It is often necessary to create graphs to effectively communicate key patterns within a dataset. Many software packages allow the user to make basic plots, but it can be challenging to create plots that are customized to address a specific idea. While there are numerous ways to create graphs, this tutorial will focus on the R package ggformula, created by Danny Kaplan and Randy Pruim.
Data: In this tutorial, we will use the AmesHousing data, which provides information on the sales of individual residential properties in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations, and a large number of explanatory variables involved in assessing home values. A full description of this dataset can be found here.
To start, we will focus on just a few variables:
SalePrice: The sale price of the home
GrLivArea: The above ground living area in the home (in square feet)
Fireplaces: The number of fireplaces in the home
KitchenQualily: The quality rating of the kitchen (Excellent, Good, Average, Fair, or Poor)
Select the Run Code button on the right to run the code in the section below.
# The code below uses the head() function to view the first four lines of the AmesHousing data. head(AmesHousing, 4)
On Your Own
The ggformula package is based on another graphics package called ggplot2. It provides an interface that makes coding easier for people new to coding in R. One primary benefit is that it follows the same intuitive structure provided by the creators of the mosaic package.
For example, if our goal is to create a scatterplot in ggformula, we use the function called
gf_point(), to create a graph with points. We then use
SalePrice as our y variable and
GrLivArea as our x variable. The
... indicates that we have an option to add additional code, but it is not required.
# Create a scatterplot of above ground living area by sales price gf_point(SalePrice ~ GrLivArea, data = AmesHousing)
It is easy to make modifications to the color, shape and transparency of the points in a scatterplot.
# Create a scatterplot with log transformed variables, coloring by a third variable gf_point(log(SalePrice) ~ log(GrLivArea), data = AmesHousing, color = "navy", shape = 15, alpha = .2)
On Your Own
alphato any values between 0 and 1. How does changing
alphamodify the points on a graph?
shape = 1?
color = ~ KitchenQuality. What color corresponds to Kitchen Quality = Good?
Notice that fixed color names are given in quotes,
color = "navy". However, if we select colors based upon a variable from our data frame, we treat it as an explanatory variable,
~ x, in our model.
The scatterplots above suffer from overplotting, that is, many values are being plotted on top of each other many times. We can use the alpha argument to adjust the transparency of points so that higher density regions are darker. By default, this value is set to 1 (non-transparent), but it can be changed to any number between 0 and 1, where smaller values correspond to more transparency. Another useful technique is to use the facet option to render scatterplots for each level of an additional categorical variable, such as kitchen quality. In ggformula, this is easily done using the gf_facet_grid() layer.
We can use the pipe operator
%>% to add a new layer into a graph. This pipe operator is an easy way to create a chain of processing actions by allowing an intermediate result (left of the %>%) to become the first argument of the next function (right of the %>%). Below we start with a scatterplot and then assign that scatterplot to the
gf_facet_grid() function to create distinct panels for each type of kitchen quality. Then the result is again passed to the
gf_labs() function, which adds titles and labels to the graph.
# Create distinct scatterplots for each type of kitchen quality gf_point(SalePrice/100000 ~ GrLivArea, data = AmesHousing) %>% gf_facet_grid(KitchenQuality ~ . ) %>% gf_labs(title = "Figure 3: Housing Prices in Ames, Iowa", y = "Sale Price (in $100,000)", x = "Above Ground Living Area")
Figure 3 facets the scatterplot by Kitchen Quality. In Figure 4, we overplot these graphs, and use color and shape to identify the Kitchen Quality. Both graphs allow us to look at Sale Price by Above Ground Living Area and Kitchen Quality at the same time. Often, researchers create multiple graphs to determine which best shows patterns within the data. Figure 4 allows us to more clearly see the effect of Kitchen Quality on the Sale Price, however, it is still difficult to see many of the points.
gf_point(SalePrice/100000 ~ GrLivArea, data = AmesHousing, shape = ~ KitchenQuality, color = ~ KitchenQuality) %>% gf_lm() %>% gf_labs(title = "Figure 4: Housing Prices in Ames, Iowa", y = "Sale Price (in $100,000)", x = "Above Ground Living Area")
gf_lm()adds a linear model to the graph in Figure 4. If you use
gf_smooth(), A smooth curve will be fit to the data.
gf_facet_wrapcommands are used to create multiple plots. Try incorporating
gf_facet_wrap( ~ KitchenQuality)or
gf_facet_grid(Fireplaces ~ KitchenQuality)into a scatterplot to see how it separates each graph by a categorical variable. Note that
gf_facet_grid(KitchenQuality ~ . )indicates that we should facet by the y axis and not by the x axis.
size), the commands include quotes, such as
color = "green". When characteristics are dependent on the data, the command should occur without quotes, such as
color = ~ KitchenQuality.
The previous examples focused on scatterplots. The ggformula description lists multiple graphs and gives detailed examples of how they are used. You can also use the tab completion feature in RStudio (type
gf_ and hit the
Tab key on your console) to see options for most graphs.
In this section we will build upon the same
y ~ x format for every ggformula plot and demonstrate several additional types of graphs that can be made.
y ~ x | z can be read
y is modeled by (or depends on) x differently for each z. This is just like using the facet commands. For example, the following code creates a jitter plot that incorporates two explanatory variables,
Fireplaces. We then use the pipe operator to add boxplots and labels.
gf_jitter(log(SalePrice) ~ KitchenQuality | Fireplaces , data = AmesHousing, color = "lightblue") %>% gf_boxplot(alpha = .05) %>% gf_labs(title = "Figure 5: Housing Prices in Ames, Iowa", subtitle = "Faceted by Number of Fireplaces", y = "Log(Sale Price)")
When we are working with univariate graphs, we still keep our structure of
y ~ x | z. For example if we want a histogram of SalePrice, we only have one
x variable, so it is coded
gf_histogram(~ SalePrice, data = AmesHousing) to indicate that there is no y variable in our graph. In the code below we create a histogram for SalePrice, colored for each level of KithchenQuality.
gf_histogram(~ SalePrice, data = AmesHousing, color = ~ KitchenQuality, fill = ~ KitchenQuality)
On Your Own
Which of the following will create distinct histograms of the sale price for each level of kitchen quality?
You may notice in the earlier graphs that there are very few houses with more than two fireplaces and only one house that was categorized as having poor kitchen quality. Before going further we will remove these rare points using the dplyr package. These functions are described in a separate tutorial (see examples in the Stat2Labs Website).
AmesHousing2 <- AmesHousing %>% filter(Fireplaces < 3) %>% filter(KitchenQuality != "Poor") tally(Fireplaces~KitchenQuality, data = AmesHousing2)
By removing rare points Figure 5 is now easier to read. We no longer have an entire panel representing just one data point. Modify the code below to have Fireplaces on the x-axis and facet by kitchen Quality.
gf_jitter(log(SalePrice) ~ KitchenQuality |Fireplaces , data = AmesHousing2, color = "lightblue") %>% gf_boxplot(alpha = .05) %>% gf_labs(title = "Figure 5B: Housing Prices in Ames, Iowa", subtitle = "Faceted by Number of Fireplaces", y = "Log(Sale Price)")
gf_jitter(log(SalePrice) ~ Fireplaces | KitchenQuality , data = AmesHousing2, color = "lightblue") %>% gf_boxplot(alpha = .05) %>% gf_labs(title = "Figure 5B: Housing Prices in Ames, Iowa", subtitle = "Faceted by KitchenQuality", y = "Log(Sale Price)")
A graph using
Fireplaces as our x-variable no longer creates separate boxplots for each level of the
Fireplaces variable. The reason for this is that in this data frame, each element of the
Fireplaces column is considered an integer value. When variables are quantitative, some graphs cannot identify specific levels of fireplaces. Before we can create a distinct boxplots based upon the number of fireplaces, we need to treat
Fireplaces as a factor (a qualitative variable).
If you use the
str command, you will notice that each variable is assigned one of the following
types of data:
dbldoubles or real numbers
chrcharacter values (strings of letters or symbols)
lgllogical values that contain only TRUE or FALSE.
fctrfactors (categorical variables with fixed values)
In the following code, we use the
dplyr package to create a new variable, called Fireplace2. The
as.factor command creates a factor, which is a variable that contains a set of numeric codes with character-valued levels. Try recreating Figure 5 below, using
Fireplace2 on the x-axis and facet by
AmesHousing2 <- mutate(AmesHousing2, Fireplace2 = as.factor(Fireplaces)) str(AmesHousing2)
AmesHousing2 <- mutate(AmesHousing2, Fireplace2 = as.factor(Fireplaces)) gf_jitter(log(SalePrice) ~ Fireplace2 | KitchenQuality , data = AmesHousing2, color = "lightblue") %>% gf_boxplot(alpha = .05) %>% gf_labs(title = "Figure 5B: Housing Prices in Ames, Iowa", subtitle = "Faceted by KitchenQuality", y = "Log(Sale Price)")
Modify the code below to create density plot using the new categorical variable
Fireplace2 instead of the integer valued term,
Fireplaces. Notice that the
fill command now work properly for
Fireplace2, but not for
AmesHousing2 <- mutate(AmesHousing2, Fireplace2 = as.factor(Fireplaces)) gf_density(~ SalePrice, data = AmesHousing2, color = ~ Fireplaces, fill = ~ Fireplaces)
Typing in a function will show you what attributes can be used. For example if you type
gf_histogram() in one of the code chunks above a summarized output will be displayed showing that in addition to
fill, additional attributes such as
size can be used.
Specific details can be found for any funciton in
ggformula. For example, typing
?gf_histogram() in one of the code chunks will bring you to a new website with examples and more details about this funciton.
In the above examples, only a few functions are listed. The ggformula description lists each function and gives detailed examples of how they are used. You can also use the tab completion feature in RStudio (type
gf_ and hit the Tab key on your console) to see options for most graphs.
In order to complete this activity, you will need to use the
dplyr package to manipulate the dataset before making any graphics. If you have not yet worked with the
dplyr package, hints are provided to help with the first several items listed below.
AmesHousingFulldata to only sales under normal conditions. In other words,
Condition1 == Norm. How many homes were sold under normal conditions?
TotalSqFt = GrLivArea + TotalBsmtSFand remove any homes with more than 5000 total square feet. How many homes sold under normal conditions were greater than 5000 total square feet?
Noindicates no fireplaces in the home and
Yesindicates at least one fireplace in the home. How many of the remaining houses have at least one fireplace?
tally(~ Condition1, data = AmesHousingFull) AmesHousingFull <- filter(AmesHousingFull, Condition1 == "Norm") dim(AmesHousingFull)
AmesHousingFull <- mutate(AmesHousingFull, TotalSqFt = GrLivArea + TotalBsmtSF) AmesHousingFull <- filter(AmesHousingFull, TotalSqFt <= 5000) dim(AmesHousingFull)
AmesHousingFull <- mutate(AmesHousingFull, Fireplace = ifelse(Fireplaces == 0, "No", "Yes")) tally(~ Fireplace, data = AmesHousingFull)
https://www.rdocumentation.org/packages/ggformula/versions/0.7.0: A full description of the ggformula package.
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf: Data Visualization with ggplot2 Cheat Sheet. Note that this sheet lists several shape scales and color scales that can be used within ggformula.