In this notebook you'll learn about using ggplot2 to make publication quality figures.
With Jupyter Notebook you can get a nice popup of function definitions just like you can in RStudio. Simply navigate to a cell or start a new one, and enter in ?function like you would normally. A popup will appear.
You should see an Insert dropdown menu and Run button at the top which lets you add cells as well as run code or render Markdown in the cells, but these are very useful keyboard shortcuts for the same functions:
library(tidyverse)
library(gridExtra)
library(ggrepel)
library(maps)
Core feature of exploratory data analysis is asking questions about data and searching for answers by visualizing and modeling data. Most questions around what type of variation or covariation occurs between variables.
Base R comes with some functions to visualize your data -- base R plots might look something like this:
options(repr.plot.width=10, repr.plot.height=7)
# regular plot functions in R
plot(x=mpg$displ,y=mpg$hwy)
You can also use ggplot2 for your visualizations -- here's an example of default parameters in ggplot2:
# ggplot!
ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy))
You can also make publication-quality visualizations using ggplot2:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(x="Engine displacement (L)",y="Heighway fuel economy (mpg)",
title = "Fuel efficiency generally decreases with engine size",
caption = "Data from fueleconomy.gov",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
colour = "Car type"
) + theme_classic()
All plots in ggplot follow the same syntax:
ggplot(data=<DATA>) +
<GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)
Let's use the head()
function to look at the data we plotted in the above examples:
head(mpg) # automatically loaded when you load tidyverse
Let's break down the components of ggplot. First, note that ggplot(data=<DATA>)
on its own will not actually plot anything.
ggplot(mpg)
This is because we need the <GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)
to tell us what exactly to plot using our data. However, just ggplot(data=<DATA>) + <GEOM_FUNCTION>()
on its own doesn't do anything either.
ggplot(mpg) + geom_point()
So in fact we need all of the components described in the ggplot syntax.
ggplot(mpg) + geom_point(mapping=aes(x=displ,y=hwy))
<MAPPINGS>
¶ggplot(data=<DATA>) +
<GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)
Mappings refer to the visual properties of objects in the plot, i.e. size, shape, color. Can display points from other variables (in this case class) in different ways by changing value of aesthetic properties. These are known as levels, which is done in order to distinguish aesthetic values from data values.
Let's try using geom_point
to make some scatter plots and we can modify the mappings to change how we represent the class
categories.
p1 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,color=class))
p2 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,shape=class))
p3 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,size=class))
p4 <- ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,alpha=class))
grid.arrange(p1,p2,p3,p4,nrow=2)
So, we can represent the class
data as the color
, shape
, size
, or alpha
(transparency scales). As you can see, not all mappings lend themselves to all data -- there's only 6 shape
options available (we would need 7) and alpha
and size
aren't recommended for discrete data.
ggplot2 automatically assigns a unique level of an aesthetic to a unique value of the variable. This process is known as scaling. It will also automatically select a scale to use with the aesthetic (i.e. continuous or discrete) as well as add a legend explaining the mapping between levels and values. That's why in the shape mapping there's no shape for suv, and why the following two pieces of code do different things:
For color property, all data points were assigned to 'blue', therefore ggplot2 assigns a single level to all of the points, which is red
ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,color='blue'))
Here, color is placed outside aesthetic mapping, so ggplot2 understands that we want color of points to be blue
ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy),color='blue')
cty
is a continuous variable, so when mapped to color we get a gradient with bins instead
ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,color=cty))
Generally continuous scales get chosen for numerical data and discrete scales are chosen for categorical data. If your data is numeric but in discrete categories you may have to use as.factor()
in order to get proper levels.
If we try to map cyl
to shape
we get an error because shape
is only for discrete variables even though we only have cyl
=4,5,6 or 8
ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,shape=cyl))
We can transform cyl
into categorical variable with levels using the as.factor
function:
as.factor(mpg$cyl)
Now we can try plotting again:
ggplot(data=mpg) + geom_point(mapping=aes(x=displ,y=hwy,shape=as.factor(cyl)))
Note that this means x and y are aesthetic mappings as well. In fact without them you will get an error.
ggplot(data=mpg) + geom_point()
<GEOM_FUNCTION>
¶ggplot(data=<DATA>) +
<GEOM_FUNCTION>(mapping=aes(<MAPPINGS>)
geom is geometrical object that the plot uses to represent data. Bar charts use bar geoms, line charts use line geoms, scatterplots use point geoms, etc. Full list of geoms provided with ggplot2 can be seen in ggplot2 reference. Also exist other geoms created by other packages.
Every geom function in ggplot2 takes a mapping
argument with specific aesthetic mappings that are possible. Not every aesthetic will work with every geom. For example, can set shape of a point, but not shape of a line. However, can set linetype of a line.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
We can also specify that the linetype
should be as.factor(cyl)
and see that the data has been separated into three lines based on their drivetrain: 4 (4wd), f (front), r (rear)
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = as.factor(cyl)))
Can display multiple geoms on same plot just by adding them -- lets add geom_smooth
to geom_point
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color=drv)) +
geom_smooth(mapping = aes(x = displ, y = hwy, color=drv, linetype=drv))
Geoms like geom_smooth()
use single geometric object to display multiple rows of data. If you don't necessarily want to add other distinguishing features to the geom like color, can use group
aesthetic (for a categorical variable) to draw multiple objects.
ggplot(data=mpg) +
geom_smooth(mapping=aes(x=displ,y=hwy,group=drv))
You can use ?geom_smooth
to see a full list of which aesthetics geom_smooth
will understand.
ggplot()
function contains global mapping, while each geom has a local mapping
Global mapping of displ
and hwy
creates the x and y axes:
ggplot(data=mpg, mapping=aes(x=displ,y=hwy))
Mapping color
to class
for point geom while using global x and y mappings:
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + geom_point(mapping=aes(color=class))
geom_smooth
doesn't need any mapping arguments if using global mapping:
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
geom_point(mapping=aes(color=class))+
geom_smooth()
The second geom_smooth
uses same x and y mapping but mapping comes from no_2seaters
data (from the Tidyverse section of the workshop) instead
no_2seaters <- filter(mpg, class != "2seater")
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth() +
geom_smooth(data = no_2seaters)
We have gone over the minimum required syntax for ggplot, but there are additonal options that can be specified to further customize your plots, such as <FACET_FUNCTION>:
{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Facets can be used to create subplots displaying one subset of data.
facet_wrap()
for a single variable.facet_grid()
for along 2 variables.ggplot(data=mpg) +
geom_point(mapping=aes(x=displ,y=hwy)) +
facet_wrap(~ class, nrow=2)
You can use the nrow
argument to change the arrangement of the subplots:
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ,y=hwy)) +
facet_wrap(~ class, nrow=3)
ggplot(data=mpg) +
geom_point(mapping=aes(x=displ,y=hwy)) +
facet_wrap(~ class, ncol=4)
When using facet_grid
, some facets might be empty because no observations have those combinations:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
The stat argument can be used to specify algorithm used to calculate new values for a graph. Each geom object has a default stat, and each stat has a default geom. Geoms like geom_point()
will leave data as is, known as stat_identity()
. Graphs like bar charts and histograms will bin your data and compute bin counts, known as stat_count()
. Can see full list of stats at ggplot2 reference under both Layer: geoms and Layer: stats.
ggplot(data=mpg) +
geom_bar(mapping=aes(x=class))
Since each stat comes with a default geom, can use stat to create geoms on plots as well.
ggplot(data=mpg) +
stat_count(mapping=aes(x=class))
Because stat_count() computes count
and prop
, can use those as variables for mapping as well
ggplot(data=mpg) + geom_bar(mapping=aes(x=class, y=..prop..,group=1))
Stat_summary
is associated with geom_point range, the default is to compute mean and standard error
ggplot(data = mpg) +
stat_summary(mapping = aes(x=class,y=hwy))
Can change stat_summary to compute median and min/max instead
ggplot(data = mpg) +
stat_summary(
mapping = aes(x = class, y = hwy),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Each geom also comes with a default position adjustment specified by position
argument. For geoms like geom_point()
it is "identity" which is position as is.
Specifically for bar charts, have fill aesthetic. If fill aesthetic gets mapped to another variable, bars are automatically stacked under the "stack" position. Can see list of positions at ggplot2 reference.
p1 <- ggplot(data = mpg, mapping=aes(x=class,fill=as.factor(cyl)))
p1 + geom_bar()
position = identity
will place each object exactly where it falls in context of graph, which isn't super useful for bar charts, better for scatterplots.
p1 + geom_bar(position="identity", alpha=0.2)
position = fill
will make bars same height
p1 + geom_bar(position="fill")
position = dodge
places objects directly beside one another, which can make it easier to compare individual values.
p1 + geom_bar(position="dodge")
For geom_point
one possible position is "jitter", which will add a small amount of random noise to each point. This spreads points out so that it's unlikely for points to overlap and therefore get plotted over each other. For example it's possible that majority of points are actually one combination of hwy
and displ
but they all get plotted at the exact same point so you can't tell. For very large datasets can help prevent overplotting to better see where mass of plot is or trends.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
This plot makes the data quite uniform -- maybe there's multiple observations with same value of cty/hwy creating overlapping points. Let's check:
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position="jitter")
position=jitter
has cleared up the overlapping points for us.
{r}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Default coordinate system is Cartesian.
coord_flip()
switches x and y axes.coord_quickmap()
sets aspect ratio for maps.coord_polar()
sets polar coordinates.p <- ggplot(data = mpg, mapping = aes(x = class, y = hwy))
p + geom_boxplot()
Can use coord_flip()
to flip the coordinates:
p + geom_boxplot() + coord_flip()
Can also reorder x axis by lowest to highest median hwy mileage, which might allow easier comparisons
ggplot(data = mpg, mapping = aes(x = reorder(class,hwy,FUN=median), y = hwy)) +
geom_boxplot() +
coord_flip()
You can also use geom_polygon
to make some maps:
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
Can also tweak the aspect ratios:
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
Can also use polar coordinates
bar <- ggplot(data = mpg) +
geom_bar(
mapping = aes(x = class, fill = as.factor(cyl)),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
p1 <- bar + coord_flip()
p2 <- bar + coord_polar()
grid.arrange(p1,p2, nrow=1)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally\n decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov",
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
color = "Car type"
)
Can use geom_text()
to add text labels on the plot.
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
Can also use ggrepel
to add labels:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class) +
labs(
caption = "Data from fueleconomy.gov",
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
colour = "Car type"
) +
geom_point(size = 3, shape = 1, data = best_in_class)
breaks
: For the position of tickslabels
: For the text label associated with each tick.Specify the y-scale breaks:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
Remove axis tick labels:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
Can also log-scale axes:
p1 <- ggplot(diamonds, aes(carat, price)) +
geom_bin2d()
ggplot(diamonds, aes(carat, price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()
Could get the same plot by specifying log10(carat)
and log10(price)
in the aesthetics mapping:
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d()
Can also use different ggplot palettes to change the colors -- let's compare the default to the Set1
palette:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
scale_colour_brewer(palette = "Set1")
Use ?scale_colour_brewer()
to see a list of palettes.
You can also manually specify which colors to use:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
scale_colour_manual(values=c(`4`="red",f="blue",r="blue"))
theme(legend.position)
to control legend position. guides()
with guide_legened()
or guide_colourbar()
for legend display.
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
p_left <- base + theme(legend.position = "left")
p_top <- base + theme(legend.position = "top")
p_bottom <- base + theme(legend.position = "bottom")
p_right <- base + theme(legend.position = "right")
Let's use grid.arrange
to look at our plots:
grid.arrange(p_left, p_right, nrow = 2)
grid.arrange(p_top, p_bottom, nrow = 1)
Let's pull a few of these pieces together to start making our publication quality visualization:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
theme(legend.position = "bottom") +
guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))
Three ways to control plot limits:
xlim
and ylim
in coord_cartesian()
Can set xlim
and ylim
in coord_cartesian
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
Can adjust what data are plotted, but note that geom_smooth
will plot its regression over the subsetted data.
filter(mpg, displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
You can also have different scales along hwy
and displ
if you subet the data
suv <- mpg %>% filter(class == "suv")
compact <- mpg %>% filter(class == "compact")
ggplot(suv, aes(displ, hwy, colour = drv)) +
geom_point()
ggplot(compact, aes(displ, hwy, colour = drv)) +
geom_point()
Note that the first plot is showing 4 and r for drv
, while the second is showing 4 and f for drv
.
Can set limits in each scale
x_scale <- scale_x_continuous(limits = range(mpg$displ))
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
col_scale <- scale_colour_discrete(limits = unique(mpg$drv))
ggplot(suv, aes(displ, hwy, colour = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
ggplot(compact, aes(displ, hwy, colour = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
ggplot2 has 8 themes by default, can get more in other packages like ggthemes. Generally prefer theme_classic()
.
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE)
p1 <- base + theme_bw()
p2 <- base + theme_light()
p3 <- base + theme_classic()
p4 <- base + theme_linedraw()
p5 <- base + theme_dark()
p6 <- base + theme_minimal()
p7 <- base + theme_void()
grid.arrange(base,p1,p2,p3,p4,p5,p6,p7,nrow=3)
ggsave()
will save most recent plot to disk (can also specify which plot to save if you save the plot as an object first).tiff()
will save next plot to diskpostscript()
for eps files, etc.width
, height
, fonts
, pointsize
, res
(resolution) argumentsp1 <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(x="Engine displacement (L)",y="Heighway fuel economy (mpg)",
title = "Fuel efficiency generally decreases with engine size",
caption = "Data from fueleconomy.gov",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
colour = "Car type"
) + theme_classic()
p1
ggsave("my_plot.pdf")
tiff("my_plot.tiff",width=7,height=5,units="in",pointsize=8,res=350)
p1
dev.off()
We don't have time in this workshop to get in depth, but here are some more useful visualization packages that may be helpful for your research.
Resources and associated packages:
Meant to provide publication-ready theme for gplot2 that requires minimum amount of fiddling with sizes of axis labels, plot backgrounds, etc. Auto-sets theme_classic()
for all plots.
Can be installed from Bioconductor.
Website is very comprehensive.
Now that we've gone through tidying, transforming, and visualizing data let's review all of the different functions we've used and in some cases learned the inner workings of:
gather()
spread()
separate()
unite()
%>%
propagates the output from a function as input to another. eg: x %>% f(y) becomes f(x,y), and x %>% f(y) %>% g(z) becomes g(f(x,y),z).filter()
to pick observations (rows) by their valuesarrange()
to reorder rows, default is by ascending valueselect()
to pick variables (columns) by their namesmutate()
to create new variables with functions of existing variablessummarise()
to collapes many values down to a single summarygroup_by()
to set up functions to operate on groups rather than the whole data setggplot
- global data and mappingsgeom_point
- geom for scatterplotsgeom_smooth
- geom for regressionsgeom_pointrange
- geom for vertical intervals defined by x
, y
, ymin
, and ymax
geom_bar
- geom for barchartsgeom_boxplot
- geom for boxplotsgeom_polygon
- geom for polygonsaes(color)
- color mappingaes(shape)
- shape mappingaes(size)
- size mappingaes(alpha)
- transparency mappingas.factor()
- transforming numerical values to categorical values with levelsfacet_grid
facet_wrap
stat_count
- default stat for barcharts, bins by x and countsstat_identity
- default stat for scatterplots, leaves data as isstat_summary
- default stat for pointrange, by default computes mean and se of y by xposition="identity"
position="stacked"
position="fill"
position="dodge"
position="jitter"
coord_flip
coord_map
coord_polar
sessionInfo()