I recently coauthored a couple of papers on trends in environmental data (Curtis and Simpson; Monteith et al.), which we estimated using GAMs. Both papers included plots like the one shown below wherein we show the estimated trend and associated point-wise 95% confidence interval, plus some other markings. The coloured sections show where the estimated trend is changing in a statistically significantly manner, i.e. where a 95% confidence interval on the first derivative (rate of change) of the trend does not include 0. That particular figure and the others in the papers were drawn using the lattice package (Sarkar, 2008), but I could just have easily used ggplot2 (Wickham, 2009) instead. I was recently asked via email how I produced the figures in the paper. Rather than just reply to that email, I thought I’d knock up a quick post for my blog to show how it was done.
For the purposes of this post, I’m not going to show how we fitted the time series models. Instead I’m just going to show some dummy data (two random walks) that illustrate how the data need to be arranged for the plotting code I’m going to use. To start then, create the dummy data we’ll use to draw some plots
This results in the following data frame
data.frame() call created the first four columns of
tdat, where we have
Site, a factor variable indicating the two time series in the data,
"Date"class vector which starts from today’s date and increase daily for the next 100 days, which we replicate twice, once per
Fitted, a numeric vector holding the trend estimates from the model.
Here I just use two separate random walks, but for the papers we used the output from
predict()applied to the
"gamm"classed model objects
Signif, another numeric vector that will contain the same values as
Fitted, but only for regions that are important or significant in some way. At first this is initialised with
In the papers we had two variables,
Decreasing, which contained the values of the estimated trend (i.e. duplicated
Fitted) where the trend was either increasing or decreasing significantly. The general principle is the same, however; the non-
NAlocations will be indicated by a thicker line width and hence we duplicate the
Fittedvalues only for the sections that are interesting.
transform() line just adds some dummy confidence intervals to data frame, creating variables
Lower. In the papers these were approximate, point-wise 95% confidence intervals computing using the standard errors of the realizations from the estimated trend, as returned by
predict() with argument
se.fit = TRUE.
The last section in the code block just selects two random points within the interior of the each time series, which we then use to mark the start of the “interesting” period. This and the next 25 values in each time series are used as indices to copy into
Signif the corresponding values from
With that done, we can start plotting. I’ll show the lattice version first and then the ggplot one.
Start by loading lattice
The key to creating the sort of plot shown in Figure 1 is to recognise that each of the lines we want to draw can be viewed as a separate y-axis variable. lattice allows for this by specifying multiple values on the left-hand-side of the formula used to describe the plot. We also need to facet the plot on
Site. To draw the figure we use
The formula used describes the plot:
Fitted + Upper + Lower + Signif ~ Date | Site. The variables" we want to plot are all passed to the left-hand-side of the formula, with
Date used to the right of
~, indicating the x-axis variable to be used. The last part of the formula indicates conditioning on
Site and is what instructs
xyplot() to facet the resulting plot into separate panels for each
Site. The parameters
col.line all control the aesthetics of the plot, and are specified in the order that the variables appear in the formula. Hence we use solid lines for
Signif and dashed (type
2) for the confidence intervals (
Lower). In a departure from base graphics, it is the
col.line argument that is used to specify the colours used for lines drawn in the panels.
The resulting figure is shown below
Now we move on to drawing the plot using ggplot2 Start by loading loading the package
With ggplot2 the key is to notice that each of the lines we want to draw on each panel can be drawn using different
geom_line() layers, added sequentially to the plot. With each additional layer, we can override the default
mapping by changing the
y data in each layer using
aes() within the
geom_line() call. The code to create the plot is shown below.
The first line sets up the basic ggplot object with a mapping and a data object, to which we add a
geom_line() layer (line 2). Note that here we don’t specify any arguments to
geom_line(), so it picks up defaults from the base object created in line 1. In lines 3 to 5 we add additional
geom_line() layers, but now we need to override the mapping of variables to axes on the plot, which we do by updating the
mapping. We only need to change the
y data used for each layer; the
x data are taken from the base object created in line 1. Notice how we specify attributes for these lines outside the
aes() calls? This controls how each line is drawn. The final line in the code chunk uses
facet_wrap() to split the data up by
Site and draw a separate panel for each of
The resulting figure is shown below
I don’t think any of this is particularly revelatory, but, as someone did ask me how it was done, hopefully some readers will find this useful. Happy plotting!
Curtis, C. J., and Simpson, G. L. Trends in bulk deposition of acidity in the uk, 1988–2007, assessed using additive models. Ecological Indicators.
Monteith, D., Evans, C., Henrys, P., Simpson, G., and Malcolm, I. Trends in the hydrochemistry of acid-sensitive surface waters in the uk 1988–2008. Ecological Indicators.
Sarkar, D. (2008). Lattice: Multivariate data visualization with r. New York: Springer Available at: http://lmdvr.r-forge.r-project.org.
Wickham, H. (2009). Ggplot2: Elegant graphics for data analysis. Springer New York Available at: http://had.co.nz/ggplot2/book.