Tidy data - in draft - Escape from Spreadsheet Hell - Library Guides at Iowa State University

This page is under construction :-)

Introduction to Tidy Data

Like families, tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).

Hadley Wickham (2014). "Tidy Data" Journal of Statistical Software. 59(10) http://dx.doi.org/10.18637/jss.v059.i10

Tidy Data is a way of structuring data so that it can be easily understood by people and analyzed by machines. Because all tidy data sets are structured the same way, tabular data using this standard are easy to explore, understand, use, and update.

The information on this page summarizes the tidy data standard but it is highly recommended that you also read the original paper (to section 3.4) as it contains excellent examples and more detailed descriptions.

Vocabulary

Tabular data, or data table is a systematic representation of information arranged in a grid (often in a spreadsheet). Data tables are used to convey information to people and to format information for machines to process.

Variable (called a "dimension" in computer science)

"A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units." There are two types of variables:

Fixed variables: variables that describe experimental design and are known or set in advance.

Measured variables: what is measured or observed during the study, experiment, or process.

Observation

"An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes."

3 Principles of Tidy Data

Tidy data is based on three core principles:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a data table.

The principles can be easily visualized through a simple table. The table below contains 3 columns (variables), and 4 rows of data (observations) and 1 header row which names the variables. The color coding may you help visualize how the three principles work to create explicit relationships between data values.

Subject ID	Height	Length
A	5.6	4.0
B	3.5	5.2
C	6.0	6.5
D	5.9	5.8

In this example the header row is blue and represents variables; the first column is yellow and contains the fixed variable Subject ID; the values for Height and Length are green to indicate that they are 1) measurements for the variable defined by their column and 2) observations of the Subject ID they share a row with. Together, they form an observational unit.

Tidy data example

Untidy and Messy Data

Data that does not conform to the tidy data principles can still be valuable but may be harder to use. As identified by Dr. Wickham, the most common "messy data" problems are:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Examples for each are given in Wickham's paper, along with examples of how to tidy such data. Below we've included a small, messy data set that has then been tidied. The messy elements have been highlighted to help you identify them.