This tutorial will focus on creating graphs to effectively communicate key patterns within a dataset. While many software packages allow the user to make basic plots, it can be challenging to create plots that are customized to address a specific idea. While there are numerous ways to create graphs, this tutorial will focus on the R package
ggplot2, created by Hadley Wickham.
The key function used in
ggplot()the grammar of graphics plot. This function is different from other graphics functions because it uses a particular grammar inspired by Leland Wilkinson’s landmark book, The Grammar of Graphics, that focused on thinking about, reasoning with and communicating with graphics. It enables layering of independent components to create custom graphics.
Data: In this tutorial, we will use the AmesHousing data, which provides information on the sales of individual residential properties in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations, and a large number of explanatory variables involved in assessing home values. A full description of this dataset can be found here.
To start, we will focus on just a few variables:
SalePrice: The sale price of the home
GrLivArea: The above ground living area in the home (in square feet)
Fireplaces: The number of fireplaces in the home
KitchenQuality: The quality rating of the kitchen (Excellent, Good, Average, Fair, or Poor)
Select the Run Code button on the right to run the code in the section below.
# The code below uses the head() function to view the first four lines of the AmesHousing data. head(AmesHousing, 4)
ggplot functions must have at least three components:
Thus the structure of a graphic made with
ggplot() could have the following form:
# Create a histogram of housing prices ggplot(data=AmesHousing, mapping = aes(SalePrice)) + geom_histogram()
In the above code, the terms
mapping= are not required, but are used for clarification. For example, the following code will produce identical results:
ggplot(AmesHousing, aes(SalePrice)) + geom_histogram().
# Create a scatterplot of above ground living area by sales price ggplot(data=AmesHousing, mapping= aes(x=GrLivArea, y=SalePrice)) + geom_point()
ggplotwith Fireplaces as the x-axis and SalePrice as the y-axis.
ggplot(data, aes(x, y)) + geom_line()
ggplot(data) + geom_line(aes(x, y))
In the first case, the
aes is set as the default for all
geoms. In essense, the same
y variables are used throughout the entire graphic. However, as graphics get more complex, it is often best to creating local
aes mappings for each
geom as shown in the second line of code.
aesis listed within the
geom. However the resulting graph should look identical to the one above.
It is easy to make modifications to the color, shape and transparency of the points in a scatterplot.
ggplot(data=AmesHousing) + geom_point(mapping = aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQuality), alpha = .5, shape = 1)
On Your Own
alphato any values between 0 and 1. How does changing
alphamodify the points on a graph?
shape = 1?
geom_point(aes(x=log(GrLivArea), y=log(SalePrice)), color = "violet", alpha = .5,shape = 10). What color corresponds to Kitchen Quality = Good?
Notice that fixed color names are given in quotes,
color = "navy". However, if we select colors based upon a variable from our data frame, we treat it as an explanatory variable, and place it within the
aes part of our code.
The scatterplot above suffers from overplotting, that is, many values are being plotted on top of each other many times. We can use the alpha argument to adjust the transparency of points so that higher density regions are darker. By default, this value is set to 1 (non-transparent), but it can be changed to any number between 0 and 1, where smaller values correspond to more transparency. Another useful technique is to use the facet option to render scatterplots for each level of an additional categorical variable, such as kitchen quality. In ggplot2, this is easily done using the
# Create distinct scatterplots for each type of kitchen quality ggplot(data=AmesHousing) + geom_point(aes(x=log(GrLivArea), y=log(SalePrice)), alpha = .5,shape = 10) + facet_grid(. ~ KitchenQuality)
In the following code, we layer additional components onto the two graphs shown in the previous section.
ggplot(data=AmesHousing) + geom_histogram(mapping = aes(SalePrice/100000), breaks=seq(0, 7, by = 1), col="red", fill="lightblue") + geom_density(mapping = aes(x=SalePrice/100000, y = (..count..))) + labs(title="Figure 1: Housing Prices in Ames, Iowa (in $100,000)", x="Sale Price of Individual Homes")
geom_densityoverlays a density curve on top of the histogram.
y = (..count..)to modify the density. Alternatively, we could specify
aes(x = SalePrice/100000, y = (..density..))in the histogram geom.
In the code below we create three scatterplots of the log of the above ground living area by the log of sales price
ggplot(data=AmesHousing, aes(x=log(GrLivArea), y=log(SalePrice)) ) + geom_point(shape = 3, color = "darkgreen") + geom_smooth(method=lm, color="green") + labs(title="Figure 2: Housing Prices in Ames, Iowa")
ggplot(data=AmesHousing) + geom_point(aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQuality),shape=2, size=2) + geom_smooth(aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQuality), method=loess, size=1) + labs(title="Figure 3: Housing Prices in Ames, Iowa")
ggplot(data=AmesHousing) + geom_point(mapping = aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQuality)) + geom_smooth(mapping = aes(x=log(GrLivArea), y=log(SalePrice), color=KitchenQuality), method=lm, se=FALSE, fullrange=TRUE) + facet_grid(. ~ Fireplaces) + labs(title="Figure 4: Housing Prices in Ames, Iowa")
geom_pointis used to create a scatterplot. As shown in Figure 2, multiple shapes can be used as points. The Data Visualization Cheat Sheet lists several shape options`
geom_smoothadds a fitted line through the data.
method=lmspecifies a linear regression line.
method=loesscreates a smooth fit curve.
se=FALSEremoves the shaded confidence regions around each line.
fullrange=TRUEextends all regression lines to the same length
facet_wrapcommands are used to create multiple plots. In Figure 4, we have created separate scatterplots based upon the number of fireplaces.
size), the commands occur outside the
aes, as in Figure 2,
color="green". When characteristics are dependent on the data, the command should occur within the
aes, such as in Figure 3,
In the above examples, only a few
geoms are listed. The ggplot2 website lists each
geom and gives detailed examples of how they are used.
GrLivArea. You may get a warning related to bin size. Try specifying the breaks for the histogram,
breaks=seq(0, 6000, by = 1000)or
breaks=seq(0, 6000, by = 100).
# A histogram of above ground living area ggplot(data=AmesHousing) + geom_histogram(mapping = aes(GrLivArea))
Fireplacesas the explanatory variable and
SalePriceas the response variable. Include a regression line, a title, and labels for the x and y axes.
# Create a scatterplot of above ground living area by sales price ggplot(data=AmesHousing, aes(x=Fireplaces, y=SalePrice)) + geom_point() + geom_smooth(method=lm) + labs(title="Housing Prices in Ames, Iowa", x="Fireplaces", y = "Sale Price")
strcommand after reading data into R, you will notice that each variable is assigned one of the following
types: Character, Numeric (real numbers), Integer, Complex, or Logical (TRUE/FALSE). In particular, the variable Fireplaces in considered an integer. In the code below we try to
filla density graph by an integer value. Notice that the color and fill commands appear to be ignored in the graph.
# str(AmesHousing) ggplot(data=AmesHousing) + geom_density(aes(SalePrice, color = Fireplaces, fill = Fireplaces))
In the following code, we use the
dplyr package to modify the AmesHousing data; we first restrict the dataset to only houses with less than three fireplaces and then create a new variable, called Fireplace2. The
as.factor command creates a factor, wich is a variable that contains a set of numeric codes with character-valued levels. Notice that the
fill command now work properly.
# Create a new data frame with only houses with less than 3 fireplaces AmesHousing2 <- filter(AmesHousing, Fireplaces < 3) # Create a new variable called Fireplace2 AmesHousing2 <-mutate(AmesHousing2,Fireplace2=as.factor(Fireplaces)) #str(AmesHousing2) ggplot(data=AmesHousing2) + geom_density(aes(SalePrice, color = Fireplace2, fill = Fireplace2), alpha = 0.2)
Customizing graphs: In addition to using a data frame, geoms, and aes, several additional components can be added to customize each graph, such as: stats, scales, themes, positions, coordinate systems, labels, and legends. We will not discuss all of these components here, but the materials in the references section provide detailed explanations. In the code below we provide a few examples on how to customize graphs.
ggplot(AmesHousing2, aes(x = Fireplace2, y = SalePrice, color = KitchenQuality)) + geom_boxplot(position = position_dodge(width = 1)) + coord_flip()+ labs(title="Housing Prices in Ames, Iowa") + theme(plot.title = element_text(family = "Trebuchet MS", color = "blue", face="bold", size=12, hjust=0))
positionis used to address geoms that would take the same space on a graph. In the above boxplot,
position_dodge(width = 1)adds a space between each box. For scatterplots,
position = position_jitter()puts spaces between overlapping points.
themeis used to change the style of a graph, but does not change the data or geoms. The above code is used to modify only the title in a boxplot. A better approach for beginners is to choose among themes that were created to customize the overall graph. Common examples are
theme_minimal(). You can also install the
ggthemespackage for many more options.
theme_bw()instead of the given theme command. Explain how the graph changes.
Tabkey to see various options) to determine what theme is the default for most graphs in ggplot.
See more tutorials at Stat2Labs.
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf: Data Visualization with ggplot2 Cheat Sheet
http://docs.ggplot2.org/current/: A well-documented list of ggplot2 components with descriptions
http://www.statmethods.net/advgraphs/ggplot2.html: Quick-R introduction to graphics
http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf: Formal documentation of the ggplot2 package
http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf: A tutorial on ggplot2 by Hadley Wickham.
http://stackoverflow.com/tags/ggplot2: Stackoverflow, an online community to share information.