Thing 21: Tools of the (dirty data) trade

Overview

Teaching: min
Exercises: min
Questions
  • What is dirty data and why we should care?

Objectives
  • Dig in to dirty data. Try your hand at using an open source data cleansing tool.

  • Getting started: Pregnant men? The data says so! Getting down and dirty with data.

  • Learn more: Turn a PDF ‘tabula rasa’ into usable data with Tabula

  • Challenge me: OpenRefine is a powerful tool for cleaning up lots of dirty data

Getting started: Dirty data stories

Data horror stories: how does it happen!?

Why is ”clean” data important? Public policy, changes to medical protocols and economic decisions all depend on accurate and complete data. Thing 21 looks at the why and what of “dirty data.”

Exploring Dirty Data

  1. Pregnant men? Yes! Read some real data horror stories about dirty data
  2. How does data get dirty in the first place? This short video (5mins) from the ASPCA (American Society for Prevention of Cruelty to Animals) shows simply how incomplete, inaccurate data can occur. You can imagine how the resultant problems would multiply exponentially the bigger the dataset. 3.Browse down the Bad Data Guide list of commonly encountered data quality issues (with possible solutions). This list is aimed at journalists but it shows who is responsible for cleaning up dirty data. Click into a few of the causes and solutions to dirty data - many of us contribute information to reports or do our home account.
  3. If you have time: For a quick guide to working with spreadsheets, check out one of the School of Data’s Data Fundamentals course. The modules use real data from eg the World Bank.ts in spreadsheets, and maybe it’s time to think about how clean our own data is!

Impacts of Dirty Data

Reflect on and discuss the wide ranging implications of how dirty data can impact on your life.

Learn more: Extracting and scraping data

How often have you found data that looks interesting, but is in a PDF or webpage… how do you get the data into spreadsheets so you can work with them?

The School of Data has fantastic, easy to follow tutorials working with real data.

Option 1

Let’s start extracting tabular data from text-based PDFs. The Extracting Data From PDFs module provides a brief overview of the different techniques used to extract data from PDFs, with a focus on introducing Tabula, a free open-source tool build for this specific task.

  1. Get ready: go to Extracting Data From PDFs
    • Download the correct version of Tabula for your operating system, and java runtime if required
    • note this tutorial doesn’t work on scanned pdfs
  2. Work through as much of the Tabula tutorial as you can and remember this tutorial for the next time you get a PDF with valuable (and hard-to-extract) data!
  3. If you have time or just love data dabbling: Extracting data from PDFs will inevitably result in some dirty data creeping into your dataset. The School of Data have some really interesting Data Cleansing modules.

Option 2:

As much as we wish everything was available in CSV or the format of our choice – most data on the web is published in different forms. How do you extract data from HTML? Use a Scraper!

  1. Go to Making data on the web useful: scraping and follow the two ‘recipes’ to learn code-free Scraping in 5-10 minutes using Google Spreadsheets & Google Chrome (Note: Use the Google Chrome Extension “Scraper, by dvhtn”)

Consider: strategies for encouraging data to be published in more re-usable forms rather than PDF.

Roll up your sleeves: clean up dirty data

OpenRefine (formerly Google Refine) is a valuable open source tool that is similar to Excel but more powerful. You can use it to: record data; manipulate data; clean up dirty data; and to transform datasets.

Option 1: If you are new to OpenRefine

1.Start by watching introduction to OpenRefine (6.48mins) to learn how it can be used to clean up messy data. 2.Now get hands on! You will need to download OpenRefine and the dataset for this activity. Work your way through as much as this tutorial as you can. The sections covering Facets and Transforming Values will give you a flavor for what the OpenRefine tool offers.

Option 2: if you’re already familiar with OpenRefine and might be ready to share what you know

Take a look through the lesson materials used to teach OpenRefine in either: OpenRefine in Library Carpentry OpenRefine in Data Carpentry for Ecology.

Consider: Carpentry course Trainers and Helpers are often in high demand. Would you consider becoming a trainer to teach tools such as Open Refine.

Key Points

  • First key point.