Library Carpentry OpenRefine

Key Points

Introduction to OpenRefine
  • OpenRefine is ‘a tool for working with messy data’

  • OpenRefine works best with data in a simple tabular format

  • OpenRefine can help you split data up into more granular parts

  • OpenRefine can help you match local data up to other data sets

  • OpenRefine can help you enhance a data set with data from other sources

Importing data into OpenRefine
  • Use the ‘Create Project’ option to import data

  • You can control how data imports using options on the import screen

Layout of OpenRefine, Rows vs Records
  • OpenRefine uses rows and columns to display data

  • Most options to work with data in OpenRefine are accessed through a drop down menu at the top of a data column

  • When you select an option in a particular column (e.g. to make a change to the data), it will effect all the cells in that column

  • OpenRefine has a Records mode which links together multiple rows into a single record

  • Splitting and joining multi-valued cells cleaning the individual values within them

  • When creating multi-valued cells in your data, choose a separator that will not appear in the data values

Faceting and filtering
  • You can use facets and filters to explore your data

  • You can use facets and filters work with a subset of data in OpenRefine

  • You can easily correct common data issues from a Facet

Clustering
  • Clustering is a way of finding variant forms of the same piece of data within a dataset (e.g. different spellings of a name)

  • There are a number of different Clustering algorithms that work in different ways and will produce different results

  • The best clustering algorithm to use will depend on the data

  • Using clustering you can replace varying forms of the same data with a single consistent value

Working with columns and sorting
  • You can reorder, rename and remove columns in OpenRefine

  • Sorting in OpenRefine always sorts all rows

  • The original order of rows in OpenRefine is maintained during a sort until you use the option to Reorder Rows Permanently

Transformations
  • You can alter data in OpenRefine based on specific instructions

  • You can expand the data editing functions that are built-in into OpenRefine by building your own

Advanced OpenRefine functions
  • OpenRefine can look up custom URLs to fetch data based on what’s in an OpenRefine project

  • Such API calls can be custom built, or one can use existing Reconciliation services to enrich data

  • OpenRefine can be further enhanced by installing extensions