Overview
Teaching: 15 min
Exercises: 20 minQuestions
How can we bring our data into OpenRefine?
How can we sort and summarize our data?
How can we find and correct errors in our raw data?
Objectives
Create a new OpenRefine project from a CSV file.
Recall what facets are and how they are used to sort and summarize data.
Recall what clustering is and how it is applied to group and edit typos.
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.
Start the program. (Double-click on the openrefine.exe file (or google-refine.exe if using an older version). Java services will start on your machine, and OpenRefine will open in your browser).
Launch OpenRefine (see Getting Started with OpenRefine).
OpenRefine can import a variety of file types, including tab separated (tsv
), comma separated (csv
), Excel (xls
, xlsx
), JSON, XML, RDF as XML, Google Spreadsheets. See the OpenRefine Importers page for more information.
In this first step, we’ll browse our computer to the sample data file for this lesson. In this case, we modified the idigbio_rodents
CSV file. Some data in weight
, length
, are contrived and some spaces have been added to some text fields and these anomalies are in no way related to the original dataset.
If you haven’t already, download the data from:
https://figshare.com/s/6fe692e2883347b4c15f
Once OpenRefine is launched in your browser, the left margin has options to Create Project
, Open Project
, or Import Project
. Here we will create a new project:
Create Project
and select Get data from
This Computer
.Choose Files
and select the file idigbio_rodents.csv
. Click Open
or double-click on the filename.Next>>
under the browse button to upload the data into OpenRefine.Update Preview
(bottom left).UTF-8
. In this case, it will insure that your text, no matter what language, is translated properly so that you don’t see garbled output like Å
for Å, for example.
<<Start Over
(upper left).Create Project>>
(upper right).Note that at step 1, you could upload data in a standard form from a web address by selecting Get data from
Web Addresses (URLs)
. However, this won’t work for all URLs.
Easily reorder columns to simplify work
OpenRefine makes it very simple to organize your columns, in comparison to a spreadsheet program.
Use the scroll bar at the bottom of your screen to have a look across all the fields in this dataset. Let’s move the genus
, specificEpithet
, and scientificName
columns to just after catalogNumber
. Also, move the decimalLongitude
column to be before decimalLatitude
.
All
column header (the very first column), note the drop-down arrow.Edit columns
> Re-order
\remove columns
.genus
and click and hold to drag it up just after catalogNumber
.specificEpithet
, dragging it up to just after genus
, and finish by moving scientificName
just after specificEpithet
and decimalLongitude
column to before decimalLatitude
.OK
to complete moving the columns.Exploring data by applying multiple filters
OpenRefine supports faceted browsing as a mechanism for
Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture of that column, and allows you to filter to some subset of rows for which the cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples.
Here we will use faceting to look for potential errors in data entry in the scientificName
column.
scientificName
column.Facet
> Text facet
.scientificName
column
along with a number representing how many times that value occurs in the column.Facet
list. You should see that you have an edit
function available.Solution
There will be several near-identical entries in
scientificName
. For example, there is one entry forDipodomys agilia
and one entry forDipodomis agilis
. These are both misspellings ofDipodomys agilis
. We will see how to correct these misspelled and mistyped entries in a later exercise.
Exercise
Using faceting, find out how many years are represented in this dataset.
Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
Which years have the most and least vouchered specimens and how many?
Solution
- For the column
year
doFacet
>Text facet
. A box will appear in the left panel showing that there are 70 unique entries in this column.- Try
Edit cells
>Common transforms
>To text
, and you will note theyear
values are black and no longer green and the text is left-justified. Now the columnyear
is formatted as Text. You can change the format again by doingEdit cells
>Common transforms
>To number
and see the values are green again and right-justified. DoingFacet
>Numeric facet
creates a box in the left panel that shows a histogram of the number of entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the program will assume all entries are on January 1st of the year (yikes!).- After creating a facet, click
Sort by count
in the facet box. The year with the most specimens collected (vouchered) is 2007 with 711. The least is 1973 with one.
In OpenRefine, clustering means “finding groups of different values that might be alternative representations of the same thing”. For example, the two strings New York
and new york
are very likely to refer to the same concept and just have capitalization differences. Likewise, Gödel
and Godel
probably refer to the same person. Clustering is a very powerful tool for cleaning datasets which contain misspelled or mistyped entries. OpenRefine has several clustering algorithms built in. Experiment with them, and learn more about these algorithms and how they work.
scientificName
Text Facet we created in the step above, click the Cluster
button.Method
and the Distance Function
. Try different combinations to
see what different mergers of values are suggested.nearest neighbor
method and levenshtein
distance function. It should identify nine clusters.Merge?
box beside the first five, then click Merge Selected and Recluster
to apply the corrections to the dataset.Methods
and Distance Functions
again, to see what new merges are suggested. You may find there are
still improvements that can be made, but don’t Merge
again; just Close
when you’re done. We’ll now
see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.Important: If you Merge
using a different method or distance function, or more times than described in the instructions above,
your solutions for later exercises will not be the same as shown in those exercise solutions.
If data in a column needs to be split into multiple columns, and the parts are separated by a common separator (say a comma, or a space), you can use that separator to divide up the pieces into their own columns.
scientificName
column into separate columns.scientificName
column. Choose Edit Column
> Split into several columns...
Separator
box, replace the comma with a space.Remove this column
.OK
. You’ll get some new columns called scientificName 1
, scientificName 2
, and so on.scientificName 1
and scientificName 2
are empty, as are scientificName4
and scientificName5
. (HINT: Use Facet
> Text facet
to help find the problems). Why is this? What do you think we can do to fix this?Solution
For example, the entry that has data in
scientificName 2
andscientificName 3
but not thescientificName1
column had an extra space at the beginning of the entry. Leading white spaces are very difficult to notice when cleaning data manually. This is another advantage of using OpenRefine to clean your data. We’ll look at how to fix leading and trailing white spaces in a later exercise. HINT: Why do you think there are now columnsscientificName4
andscientificName5
?
Exercise
Try to change the name of the second new column to “specificEpithet”. How can you correct the problem you encounter?
Solution
- On the
scientificName 2
column, click the down arrow and thenEdit column
>Rename this column
.- Type
specificEpithet
into the box that appears. A pop-up will appear that saysAnother column already named specificEpithet
. This is because there is another column where we’ve recorded thespecificEpithet
already.- You can choose another name like
specificEpithet_test
for this column or change the otherspecificEpithet
column to another name likespeciesEpithet_original
.
It’s common while exploring and cleaning a dataset to discover after you’ve made a change that you really should have done something else first. OpenRefine provides Undo
and Redo
operations to make this easy.
Undo / Redo
on the left side of the screen. All the changes you have made so far are listed here.scientificNames
were clustered, but not yet split.Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Words with spaces at the beginning or end are particularly hard for we humans to tell from strings without, but the blank characters will make a difference to the computer. We usually want to remove these. OpenRefine provides a tool to remove blank characters from the beginning and end of any entries that have them.
scientificName
, choose Edit cells
> Common transforms
> Trim leading and trailing whitespace
.Split
step has now disappeared from the Undo / Redo
pane on the left and is replaced with a Text transform on 3 cells
Split
operation on scientificName
that you undid earlier. Now you should only get three new columns. Why?Facet
> text facet
on scientificName2
column. Scroll down the facet
results to discover the last entry. Why is there a blank cell for scientificName2
in the second record row? Can you figure out how to fix this?Solution
For part 3 above, removing the leading white spaces means that each entry in this column has exactly one space (between the
genus
, speciesspecific epithet
, and theinfraspecificEpithet
names). Therefore, when you split with space as the separator, you will get only three columns.For the Challenge (part 4), you will have discovered 2 consecutive whitespaces in between the
genus
andspecificEpithet
in the originalscientificName
column. To fix this you will be glad to note that in addition to leading and trailing whitespaces, Open Refine gives you a way to find and fix these too.
- First, use
Undo/Re-do
to go back one step to remove theSplit 15000 cell(s) in column scientificName into several columns by separator
.- On the
scientificName
column drop-down, selectEdit cells
>Common transforms
> selectCollapse consecutive whitespace
.- Now you can repeat the
Split
operation onscientificName
and you will get 3 new columns but no unexpected blanks. You can test this out bytext facet
of each resulting column and reviewing thefacet
values in the left panel.
Important: Undo
the splitting step before moving on to the next lesson. If you skip this step, your solutions
for later exercises will not be the same as shown in those exercise solutions.
Key Points
Faceting and clustering approaches can identify errors or outliers in data.