There are several different systems for creating data visualizations in R. We will introduce ggplot2, which is based on Leland Wilkinson's Grammar of Graphics. The learning curve is a bit steep, but ultimately you'll be able to produce complex graphs more quickly and easily.
You first need to install the ggplot2 package:
You then need to load the package:
We'll consider the gapminder data from the last lesson. If it's not within your R workspace, load it again with
gapminder <- read.csv("~/Desktop/gapminder.csv")
An initial bit of code, to make a scatterplot:
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point()
Two key concepts in the grammar of graphics: aesthetics map features of the data (for example, the
lifeExp variable) to features of the visualization (for example, the y-axis coordinate), and geoms concern what actually gets plotted (here, each data point becomes a point in the plot).
Another key aspect of ggplot2: the
ggplot() function creates a graphics object; additional controls are added with the
+ operator. The actual plot is made when the object is printed.
The following is equivalent to the code above. The actual plot isn't created until the
p2 object is printed. (When you type an object's name at the R prompt, it gets printed, and that's the usual way that these plots get created.)
p1 <- ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) p2 <- p1 + geom_point() print(p2)
It's best to do the x-axis on a log scale.
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10()
We could also have used the following:
p2 + scale_x_log10()
Scales control the aesthetics (the mapping between a column in the data and a feature on the plot). More on this later.
Make a scatterplot of
gdpPercap with only for the data for China.
library(dplyr) gm_china <- filter(gapminder, country=="China") ggplot(gm_china, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10()
For a scatterplot, additional aesthetics include
For example, we might make our scatterplot for all countries, with data from 1952, and then color the points according to the continent.
gm_1952 <- filter(gapminder, year==1952) ggplot(gm_1952, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() + aes(color=continent)
Note that we could have put
color=continent within the call to
ggplot(): the following is equivalent to the above.
ggplot(gm_1952, aes(x=gdpPercap, y=lifeExp, color=continent)) + geom_point() + scale_x_log10()
Try out the
color aesthetics, both with categorical variables (such as
continent) and numeric variables (such as
You can use
geom_line to make a line plot, for example, for China:
p <- ggplot(filter(gapminder, country=="China"), aes(x=gdpPercap, y=lifeExp)) p + geom_line()
You can use both
geom_point to make a line plot with points at the data values.
p + geom_line() + geom_point()
This brings up another important concept with ggplot2: layers. A given plot can have multiple layers of geometric objects, plotted one on top of the other.
If you make the lines and points different colors, we can see that points are placed on top of the lines, since they are in the second layer.
p + geom_line(color="lightblue") + geom_point(color="violetred")
If we switch the order of
geom_line(), we'll reverse the layers.
p + geom_point(color="violetred") + geom_line(color="lightblue")
Note that aesthetics that are included in the call to
ggplot2() (or completely separately) are made to be the defaults for all layers, but we can separately control the aesthetics for each layer. For example, we could color the points by year:
p + geom_line() + geom_point(aes(color=year))
Compare that result to the following:
p + geom_line() + geom_point() + aes(color=year)
Make a plot of
gdpPercap for China and India, with lines in black but points colored by country.
One's first attempt at the previous challenge may look like this:
india_china <- filter(gapminder, country=="India" | country=="China") p <- ggplot(india_china, aes(y=lifeExp, x=gdpPercap)) p + geom_line() + geom_point(aes(color=country))
The points get connected left-to-right, which is not what we want.
If we were to make the
color=country aesthetic global, we wouldn't have this problem.
p + aes(color=country) + geom_line() + geom_point()
Alternatively, we can use the
group aesthetic, which indicates that certain data points go together. This way the lines can be a constant color.
p + geom_line(aes(group=country)) + geom_point(aes(color=country))
We could also make the group aesthetic global.
p + aes(group=country) + geom_line() + geom_point(aes(color=country))
We've focused so far on scatterplots, but one can also create one-dimensional summaries, such as histograms or boxplots.
For a histogram, you want only the
x aesthetic, and then use
binwidth to define the width of the bins. Here's a histogram of
lifeExp for 2007.
gm_2007 <- filter(gapminder, year==2007) ggplot(gm_2007, aes(x=lifeExp)) + geom_histogram(binwidth=2)
If you want to compare the distributions for the different continents, you might look at density estimates rather than histograms.
alpha indicates the opacity (
alpha=1 is completely opaque).
ggplot(gm_2007, aes(x=lifeExp)) + geom_density(alpha=0.5) + aes(fill=continent)
Alternatively, we can look at boxplots, for which you need to define a continuous variable for
y and a categorical variable for
ggplot(gm_2007, aes(y=lifeExp, x=continent)) + geom_boxplot()
I actually prefer a scatterplot for these data:
ggplot(gm_2007, aes(y=lifeExp, x=continent)) + geom_point()
I think it's better to include a bit of horizontal jittering.
ggplot(gm_2007, aes(y=lifeExp, x=continent)) + geom_point(position=position_jitter(width=0.1, height=0))
Explore the distribution of the population sizes of countries, by continent.
A particularly valuable feature of ggplot2 is faceting: the ability to make a series of plots, conditional on the values of some selected variables.
For example, rather than coloring points by continent, one might separate the continents into separate panels. There are two functions for this,
Here's the data for 2007, separated by continent.
p <- ggplot(filter(gapminder, year==2007), aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() p + facet_grid(~ continent)
Or we could split vertically. Note the need for the dot.
p + facet_grid(continent ~ .)
Or we could "wrap"
p + facet_wrap(~ continent)
facet_wrap function is particularly useful if you have a lot of facets, for example, plotting by year.
p <- ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() p + aes(color=continent) + facet_wrap(~ year)
We could also facet by both year and continent.
p + facet_grid(continent ~ year)
Perhaps we might drop Oceania in this.
gm_no_oceania <- filter(gapminder, continent != "Oceania") p <- ggplot(gm_no_oceania, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() p + facet_grid(continent ~ year)
Select five countries of interest (e.g., China, India, US, France, Nigeria) and plot
gdpPercap across time (with
geom_line), faceting by country.
If you want to save a plot, to share with others, use the
The default is to save the last plot that you created, but I think it's safer to first save the plot as an object and pass that to
ggsave. Also give the height and width in inches.
p <- ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + scale_x_log10() ggsave("scatter.png", p, height=6, width=8)
The image file type is taken from the file name extension. To make a PDF instead:
ggsave("scatter.pdf", p, height=6, width=8)
scale to adjust the sizes of things, for example for a talk/poster versus a paper/report. Use
scale < 1 to make the various elements bigger relative to the plotting area.
ggsave("scatter_2.png", p, height=6, width=8, scale=0.8)
Revisit a few of the plots you've made and try saving PNG and PDF files. Try different heights and widths. Try out the
When faceting, the different panels are given common x- and y-axis limits. If we were to create separate plots (say one for each country), we would need to do a bit extra to ensure that common axis limits are used.
scale_x_log10() function that we had used to create the log scale for the x axis. This can take an argument
limits (a vector of length 2) defining the minimum and maximum values plotted.
There is also a
scale_y_log10() function, but if you want to change the y-axis limits without going to a log scale, you would use
scale_y_continuous(). (Similarly, there's a
For example, to plot the data for China, using axis limits defined by the full data, we'd do the following:
gm_china <- filter(gapminder, country=="China") xrange <- range(gapminder$gdpPercap) yrange <- range(gapminder$lifeExp) p <- ggplot(gm_china, aes(y=lifeExp, x=gdpPercap)) + geom_line() p + scale_x_log10(limits=xrange) + scale_y_continuous(limits=yrange)
Create a plot with lines for the United States and Sweden, but using the axis limits defined with the full data.
gm_us_sweden <- filter(gapminder, country %in% c("United States", "Sweden")) xrange <- range(gapminder$gdpPercap) yrange <- range(gapminder$lifeExp) p <- ggplot(gm_us_sweden, aes(y=lifeExp, x=gdpPercap, color=country)) + geom_line() p + scale_x_log10(limits=xrange) + scale_y_continuous(limits=yrange)
If you don't like the choices for point colors, you can customize them in a number of ways. First, you can use
scale_color_manual() with a vector of your preferred choices. (If it's
fill rather than
color that you want to change, you'll need to use
gm_2007 <- filter(gapminder, year==2007) p <- ggplot(gm_2007, aes(y=lifeExp, x=gdpPercap, color=continent)) + geom_point() + scale_x_log10() colors <- c("black", "blue", "green", "red", "orange") p + scale_color_manual(values=colors)
You can also use RGB hex values.
hexcolors <- c("#001F3F", "#0074D9", "#01FF70", "#FF851B", "#B10DC9") p + scale_color_manual(values=hexcolors)
Install the package as follows:
Then display them this way:
I like the
"Dark2" palette, which you can use this way:
p + scale_color_brewer(palette="Dark2")
"Set1" is better in this case:
p + scale_color_brewer(palette="Set1")
Install the package as follows:
Here's an example using one.
library(wesanderson) p + scale_color_manual(values=wes_palette("Darjeeling"))
Make a set of histograms of
lifeExp for the year 2007, using
facet_grid to split by continent. Use different colors for the five continents, chosen manually.
The gapminder package includes a
country_colors object that colors countries by population size and continent.
I'm going to use that and add a bit of fanciness.
gm_2007 <- gapminder %>% filter(year==2007) %>% arrange(-pop) ggplot(gm_2007, aes(y=lifeExp, x=gdpPercap, fill=country, size=sqrt(pop))) + geom_point(shape=21) + scale_x_log10() + scale_fill_manual(values=country_colors) + facet_wrap(~continent) + scale_size_continuous(range=c(1, 20)) + guides(fill=FALSE, size=FALSE)
shape=21in the call to
geom_point(); this makes circles, but allowing separate colors for the
dplyr::arrange()to reorder the points, so that the larger points would be plotted first and the smaller ones on top.
range=c(1,20)to adjust the size of the points.
guides()to get rid of the legends.
Themes allow you to change the overall appearance of a plot.
The default theme is
theme_gray(), you might consider
p <- ggplot(gapminder, aes(y=lifeExp, x=gdpPercap, color=continent)) + geom_point() + scale_x_log10() + facet_wrap(~year) p + theme_bw()
p + theme_linedraw()
ggthemes() includes additional themes as well as additional color scales. Install it as follows.
library(ggthemes) p + theme_igray()
p + theme_solarized_2()
Personally, I like the default ggplot theme, but I want to remove the tick marks and add a black border. You create this as follows. (Tomorrow afternoon, we'll talk more about creating functions.)
theme_karl <- function(...) theme(panel.border=element_rect(fill=NA, color="black"), strip.background=element_rect(fill="gray80", color="black"), axis.ticks.length = grid::unit(0, "cm"), ...)
Now, put it to use:
p + theme_karl()