Authors: Christie Bahlai, Aleksandra Pawlik
Contributors: Jennifer Bryan, Alexander Duryee, Jeffrey Hollister, Daisie Huang, Owen Jones,
Ben Marwick and Sebastian Kupny.
The most common mistake made is treating the program like it is a lab notebook - that is, relying on context, notes in the margin, spatial layout of data and fields to convey information. As humans, we can (usually) interpret these things, but computers are dumb, and unless we explain to the computer what every single thing means (and that can be hard!), it will not be able to see how our data fit together.
Using the power of computers, we can manage and analyze data in much more effective and faster ways, but to use that power, we have to set up our data for the computer to be able to understand it (and computers are very literal).
This is why it’s extremely important to set up well-formatted tables from the outset - before you even start entering data from your very first preliminary experiment. Data organization is the foundation of your research project. It can make it easier or harder to work with your data throughout your analysis, so it's worth thinking about when you're doing your data entry or setting up your experiment. You can set things up in a different way in spreadsheets, but it limits your ability to work with the data in other programs or have the you-of-6-months-from-now or your collaborator work with the data.
Note: the best layouts/formats (as well as software and interfaces) for data entry and data analysis might be different. It is important to take this into account, and ideally automate the conversion from one to another.
When you're working with spreadsheets, during data clean up or analyses, it's very easy to end up with a spreadsheet that looks very different from the one you started with. In order to be able to reproduce your analyses or figure out what you did when Reviewer #3 asks for a different analysis, you must
This might be an example of a spreadsheet setup:
Put these principles in to practice today during your Exercises.
The cardinal rules of using spreadsheet programs for data:
For instance, we have data from a survey of small mammals in a desert ecosystem. Different people have gone to the field and entered data in to a spreadsheet. They keep track of things like species, plot, weight, sex and date collected.
If they were to keep track of the data like this:
the problem is that species and sex are in the same field. So, if they wanted to look at all of one species or look at different weight distributions by sex, it would be hard to set up the data to do this. If instead we put sex and species in different columns, you can see that it would be much easier.
The rule of thumb, when setting up a datasheet, is columns = variables, rows = observations, cells = data (values).
So, instead we should have:
We're going to take a messy version of the survey data and clean it up.
Download the data by clicking here
Open up the data in a spreadsheet program
You can see that there are four tabs. Data was collected from different sources by different people. Now you're the person in charge of this project and you want to be able to start doing statistics with the data.
With the person next to you, look at the messy data and identify some things wrong with this spreadsheet for good computer organization.
Important Do not forget of our first piece of advice, the create a new file (or tab) for the cleaned data, never modify the original (raw) data.
After you go through this exercise, we'll discuss as a group what you think was wrong with this data.
An excellent reference, in particular with regard to R scripting is
Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.
Previous: Introduction Next: Formatting problems