Data wrangling withdplyr

1. Introduction

Data manipulation is central to data analysis and is often the most time consuming portion of an analysis. The dplyr package contains a suite of functions to make data manipulation easier. The core functions of the dplyr package can be thought of as verbs for data manipulation.

Verb(s) Meaning
filter and slice pick specific observations (i.e. specific rows)
arrange reorder the rows
select pick variables by their names (i.e. specific columns)
mutate add new calculated columns to a data frame
summarize aggregate many rows into a single row
group_by groups data by one or more variables

Data: In this tutorial, we will use the Colleges data frame which contains information on predominantly bachelor’s-degree granting institutions from 2015. The variables and corresponding descriptions are listed below:

Variable Description
unitid A unique ID number for each school
college School name
type School type: public or private
city City
state State abbreviation
region Region of the U.S.
admissionRate Proportion of undergraduate applicants admitted
ACTmath Median ACT math score
ACTenglish Median ACT english score
undergrads Undergraduate enrollment
cost Total cost of attendance
gradRate Proportion of students graduating within six years
FYretention Proportion of first year students returning for a second year
fedloan Proportion of students with federal student loans
debt Median principal of student loans entering repayment

To view the first six rows of the available data, we use the head() function. Select the Run Code button on the right to run the code in the section below.

head(Colleges,6)

On Your OWn

  • Modify the code below to apply the tail() and str() functions to the Colleges data.
head(Colleges)
  1. How would you use the ‘tail’ function to view the last 10 rows of the Colleges data? If you are not sure, try modifying the code chunk above.“,
    • tail(Colleges, ‘10’)
    • head(Colleges, tail = ‘10’)
    • tail(Colleges, 10)
    • head(Colleges, tail = 10)
  2. How many colleges are listed in this data frame?

2. Filtering rows

To extract the rows only for colleges and universities in a specific state we use the filter function. For example, we can extract the colleges in Wisconsin from the Colleges data frame using the following code:

Wi <- filter(Colleges, state == "WI")
head(Wi)

Remarks

  • The first argument given to filter is always the data frame (this is true for all the core functions in dplyr), followed by logical tests that the returned cases must pass. In our example, the test was whether the school was in Wisconsin, which is written as state == "WI".

  • We have to use == to indicate equality because = is equivalent to <-.

  • When testing character variables, be sure to use quotes to specify the value of the variable that you are testing.

  • To specify multiple tests, use a comma to separate the tests (think of the comma as the word “and”). For example, to filter the data to only those rows corresponding to schools in Wisconsin with fewer than 2,000 undergraduate students:

smallWI <- filter(colleges, state == “WI”, undergrads < 2000)
  • To specify that at least one test must be passed, use the | character instead of the comma. For example, the below test checks whether a college is in Wisconsin or Minnesota or Iowa, so it returns all of the colleges in Wisconsin, Minnesota, and Iowa.
smallWI <- filter(colleges, state == “WI”, undergrads < 2000)
  • You can use both | and , to specify multiple tests. For example, we can return all colleges with fewer than 2,000 undergraduate students in Wisconsin, Minnesota, and Iowa.
smallWIM <- filter(Colleges, state == "WI" | state == "MN" | state == "IA", undergrads < 2000)
head(smallWIM)
  • Common comparison operators for the tests include: >, >=, <, <=, != (not equal), and == (equal).

  • To remove rows with missing values, use the R command na.omit. Running the code below will reduce the data frame to only rows with no missing values.

Colleges <- na.omit(Colleges)
  • Running the code below will eliminate only rows with NA in the cost column.
Colleges <- filter(Colleges, !is.na(cost))

On Your Own

Modify the code below to answer the following questions

  1. How many Maryland colleges are in the Colleges data frame? (The abbreviation for Maryland is MD.)

  2. How many private Maryland colleges with under 5000 undergraduates are in the Colleges data frame?

smallWIM <- filter(Colleges, state == "WI" | state == "MN" | state == "IA", undergrads < 2000)
head(smallWIM)

3. Slicing rows

To extract rows 10 through 16 from the Colleges data frame, we use the slice() function.

slice(Colleges, 10:16)

Remarks

  • To select consecutive rows, create a vector of the row indices by separating the first and last row numbers with a :.

  • To select non-consecutive rows, create a vector manually by concatenating the row numbers using c(). For example, to select the 2nd, 18th, and 168th rows use slice(Colleges, c(2, 18, 168)).

On Your Own

Modify the code above to answer the following question:

  1. What is the name of the college in row 1200 of the Colleges data frame?

4. Arranging rows

To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange() function.

costDF <- arrange(Colleges, cost)
head(costDF)

Remarks

  • By default, arrange() assumes that we want the data arranged in ascending order by the specified variable(s).

  • To arrange the rows in descending order, wrap the variable name in the desc() function. For example, to arrange the data frame from most to least expensive we would use the following command:

costDF <- arrange(Colleges, desc(cost))
  • To arrange a data frame by the values of multiple variables, list the variables in a comma separated list. The order of the variables specifies the order in which the data frame will be arranged.

For example, running the code below reorders Colleges first by the median ACT math score (in descending order) and then by the ACT english score (in descending order).

actDF <- arrange(Colleges, desc(ACTmath), desc(ACTenglish))

On Your Own

Modify the code below to answer the following quesitons:

Cost1 <- filter(Colleges, !is.na(cost))
Cost1 <- filter(Cost1, state == "CA")
Cost1 <- arrange(Cost1, -cost)
head(Cost1,10)
  1. What school is most expensive?
  2. What school has the least expensive tuition in Wisconsin?
  3. What is the most expensive college in your state?

5. Selecting columns

Suppose that you are only interested in a subset of the columns in the data frame—say, college, city, state, undergrads, and cost—and want to create a data frame with only these columns.

To do this, we select the desired columns using the select() function:

lessCols <- select(Colleges, college, city, state, undergrads, cost)
head(lessCols)

Remarks

  • After specifying the data frame, list the variable names to select from the data frame separated by commas.

  • In some cases you may want to drop a small number of variables from a data frame. In this case, putting a negative sign - before a variable name tells select() to include all but the negated variables. For example, if we only wished to drop the unitid variable we run the following command:

drop_unitid <- select(Colleges, -unitid)
head(drop_unitid)

On Your Own

In the space below, modify the Colleges data frame:

  • First, use the filter() function to remove any rows that have missing values in either the ACTmath or ACTenglish columns.
  • Then, limit the columns to college, state, ACTmath, ACTenglish, cost, and gradRate.
  • Finally, sort the data by the ACTmath, then ACTenglish scores, from the highest to lowest scores.
Col1 <- filter(Colleges, !is.na(ACTmath), !is.na(ACTenglish))
Col1 <- select(Col1, college, state, ACTmath, ACTenglish, cost, gradRate)
Col1 <- arrange(Col1, -ACTmath, -ACTenglish)
dim(Col1)
  1. What college has the hightest ACTmath score?
  2. How many rows are in this new data frame?
  3. How many colleges and an average ACTmath score of 34?

6. Mutating data

Data frames often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate() function to add a new column to a data frame that is calculated from other variables.

For example, we may wish to report percentages rather than proportions for the admissions rate:

Colleges <- mutate(Colleges, admissionPct = 100 * admissionRate)
head(Colleges)

Make sure you use the arrow key to view the columns at the end of this data frame.

Remarks

  • After specifying the data frame, give the name of the new variable and it’s definition. Notice that we need to use = to assign the value of the new variable.

  • To add multiple variables once, separate the list of new variables by commas. For example, we can also add percentage versions of FYretention and gradRate.

Colleges <- mutate(Colleges, 
                   FYretentionPct = 100 * FYretention,
                   gradPct = 100 * gradRate)
head(Colleges)

8. Summarizing rows

To create summary statistics for columns within the data frame we must aggregate all of the rows using the summarize() function. Note that you can also use summarise(), which is the British spelling of “summarize”:

For example, to calculate the average cost of all 1776 colleges in our data frame we run the following command:

summarize(Colleges, medianCost = mean(cost, na.rm = TRUE))

Remarks

  • As with all of the functions we have seen, the first argument should be the name of the data frame.

  • We add the argument na.rm = TRUE here to remove any missing values in the cost column before the calculation. Many functions, including the summarize() function, will return an error if there are missing values (blanks, NAs or NaNs) in your data.

  • summarize() returns a data frame, with one row and one column.

  • We can ask for multiple aggregations in one line of code by simply using a comma separated list.

For example, we can calculate the five number summary of cost for all 1776 colleges in our data frame:

summarize(Colleges, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
  • Notice that even when multiple statistics are calculated, the result is a data frame with one row and the number of columns correspond to the number of summary statistics.

On Your Own

  1. What happens if we remove na.rm = TRUE from the code above?

9. Groupwise manipulation

Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by() function, and then we can use any of the functions previously mentioned in this tutorial. Most often group_by() is paired with the summarize() or mutate() functions.

Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type defines the groups of interest.

Colleges_by_type <- group_by(Colleges, type)
str(Colleges_by_type)

Remarks

  • After specifying the data frame, list the categorical variable(s) defining the groups.

  • While the dimensions of the data frame does not change, the output near the end of the str(Colleges_by_type) tells us that groups are a Factor with 2 levels “private”, “public”. Be sure to pay attention to this to ensure it matches your expectations. For example, if there were any typos in the column or if just one value is capitalized (such as Public) we would be told there are more than two groups.

  • Multiple variables can be used to specify the groups. For example, to specify groups by state and type, we would run the following command:

Colleges_state_type <- group_by(Colleges, state, type)
dim(Colleges_state_type)

Combining group_by with other commands

Once we have a grouped data frame, we can obtain summaries by group via summarize(). For example, the five number summary of cost by institution type is obtained below

Colleges_by_type <- group_by(Colleges, type)
summarize(Colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))

We can also calculate new variables within groups, such as the standardized cost of attendance within each state:

Colleges_by_state <- group_by(Colleges, state)
Colleges_by_state <- mutate(Colleges_by_state, 
                            mean.cost = mean(cost, na.rm = TRUE), 
                            sd.cost = sd(cost, na.rm = TRUE),
                            std.cost = (cost - mean.cost) / sd.cost)
Colleges_by_state <- arrange(Colleges_by_state, state)
Colleges_by_state

Remarks

  • mutate() allows you to use variables defined earlier to calculate a new variable. This is how std.cost was calculated.

  • The group_by() function returns an object of class c("grouped_df", "tbl_df", "tbl", "data.frame"), which looks confusing, but essentially allows the data frame to be printed neatly. To convert the results back to a data.frame using the as.data.frame() function.

Colleges_by_state_df <- as.data.frame(Colleges_by_state)

On Your Own

Modify the code below to answer the following quesitons:

Colleges_by_type <- group_by(Colleges, type)
summarize(Colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
Colleges_by_type <- group_by(Colleges, region)
summarize(Colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE),
          counts = n())
  1. Group the Colleges data frame by region. How many regions are in this data frame?

  2. Calculate the five number summery for the cost of each region. Which region tends to have the lowest cost?

  3. Calculate the number of rows for each region. Which region has the most colleges?

10. References and resources

  • Find a complete list of tutorials at Stat2Labs.

  • RStudio’s data wrangling cheat sheet provides a nice summary of the functions in the dplyr package, including those covered in this tutorial.

  • The introductory vignette to dplyr provides an example of wrangling a data frame consisting of 336,776 flights that departed from New York City in 2013.

  • Roger Peng’s video overview of the dplyr package.