In this tutorial, we focus using the
tidyr package to organize data into a format that can be easily used in analysis.
In R, it is easiest to work with data that follow five basic rules:
A data set satisfying these rules is said to be tidy, a term popularized by Hadley Wickham.
Remark: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.
This tutorial will describe the following
tidyr commands, which can be thought of as verbs for tidying data:
||collapses multiple columns into two columns|
||creates multiple columns from two columns|
||splits compound variables into individual columns|
UBS (derived from the Union Bank of Switzerland) is an international bank that reports prices of various staples in major cities every three years. The data set in the
UBSprices data set contains prices of a 1 kg bag of rice in 2009 and 2003 in major world cities.
This data set is not tidy because each row contains two cases: the city in 2003 and the city in 2009. Additionally, one of our variables of interest, year, is given in the column names
2009. Instead, year should provided in just one column. In order to tidy these data, we need to
To do this, we will use the
gather function in the
gather collapses multiple columns into two columns: a key column and a value column. The key will be a new variable containing the old column names and the value will contain the information recorded in the cells of the collapsed columns.
In our example, we want to collapse the original two columns,
rice2009, into the key-value pair
price. To do this, we use the following command:
tidy_ubs <- gather(data = UBSprices, key = year, value = price, rice2003, rice2009) head(tidy_ubs)
gathershould be the data frame being tidied. This is true for all of the
tidyrfunctions we discuss in this tutorial.
On Your Own
gathercommand? Use the
dim(UBSprices)command to determine how many rows are in the UBSprices data set and
dim(tidy_ubs)to determine how many are in the tidy_ubs data set).
Finally, we will modify the year column by removing the word rice from each cell. To do this, we can use the
extract_numeric function in the
tidyr package. We now have a data set that we can call tidy.
tidy_ubs$year <- extract_numeric(tidy_ubs$year) head(tidy_ubs)
This data set started in a relatively tidy form, so it may be difficult to see the benefit of tidying it. Tidy data is typically required for summarizing and plotting data in R. For example, consider making a side-by-side boxplot using
ggformula (see the tutorial An Introduction to ggformula).
tidy_ubs %>% gf_boxplot(price ~ factor(year)) %>% gf_labs(x = "year")
This was straightforward since tidy_ubs was already tidy, but would have required extra manipulation in the original format.
separate + gather)
The Polls data set contains the results of various presidential polls conducted during July 2016, and was scraped from RealClear Politics.
Here, the data set is not tidy because
Datecolumn contains both the beginning and end dates. These should be stored in separate columns.
Samplecolumn contains two variables: the number of people in the sample and the population that was sampled (likely voters or registered voters). These should be stored in separate columns.
partyvariables, which should be stored in their own columns.
To break a single character column into multiple new columns we use the
separate function in the
To begin, let’s break the
Date column into
Polls2 <- separate(data = Polls, col = Date, into = c("Begin", "End"), sep = " - ") head(Polls2)
col, specifies the name of the column to be split.
into, specifies the names of the new columns. Note that since these are specific column names we are creating, they should be given in quotes.
sep = " - ", then R would erroneously use
\as the separator. To manually specify the separator between columns we can place the character(s) in quotes.
sep = " - ", the spaces around
-avoid excess whitespace in the resulting cell values.
In the space below, separate the
Sample column into
population columns. Here we will use
sep = " " to identify that a blank space is used as the separator. If you have difficulty, click the hint button.
Polls3 <- separate(data = Polls2, col = Sample, into = c("size", "population"), sep = " ") head(Polls3)
Next, we need to
gather the last four columns into a
Polls4 <- gather(data = Polls3, key = candidate, value = percentage, 7:10) head(Polls4)
Notice that instead of writing out the column names (
Trump..R., etc.) we can simply specify the column numbers—here
7:10 specifies that we are gathering columns 7 through 10.
Finally, we need to
separate the candidate names from the political party.
Polls5 <- separate(Polls4, candidate, into= c("candidate", "party")) head(Polls5)
In the last command we let R guess which separator to use. This worked, but resulted in a warning message—we’re lucky that it worked! There are many situations where the separator is too complex for R to guess correctly and it cannot be specified using a simple character in quotes. In such cases we need to use regular expressions (see the tutorial Manipulating Text) to aid our data tidying, but that’s a topic for another tutorial. The important thing to note here is that you should always check that
separate worked as you expected, don’t blindly trust it!
gather + separate + spread)
The airlines data set contains the raw data behind the article Should Travelers Avoid Flying Airlines That Have Had Crashes in the Past? that appeared on fivethirtyeight.com.
In this example, a case is best described as an airline in a specific time frame, so these data are not tidy because each case is not its own row. Additionally, the last six column names contain the time frame, which is a value. In order to tidy this data set we must
yearscolumn to specify the time frame,
gather the last six columns into a common
accidents column. This will allow us to easily create the
Airlines2 <- gather(Airlines, key = accidents, value = count, 3:8) head(Airlines2)
separate the values of the new
accidents column into
var (short for variable) and
years. The default guessing scheme fails here, so we must specify
sep = "[.]" to denote that the period is the separator. If you want to learn more about why we need brackets around the period you need to delve into regular expressions (see the tutorial Manipulating Text).
Airlines3 <- separate(Airlines2, accidents, into = c("var", "years"), sep = "[.]") head(Airlines3)
Finally, we need to ensure that each row corresponds to a case. (Don’t worry, this will also make each column a variable!) Currently, there are six rows for each airline: one for each
var in each time frame. To solve this problem, we need to
spread out the
var column so that each variable has its own column.
Airlines4 <- spread(data = Airlines3, key = var, value = count) head(Airlines4)
Notice that the first argument given to
spread is the data frame, followed by the key-value pair. The key is the name of the column whose values will be used as column headings and the value is the name of the column whose values will populate the cells of the new columns. In this example, we use
var as the key and populate the cells with the