From the bottom of the heap

gratia 0.9.0

2024-03-28T11:00:00+01:00

I am pleased to announce the release of gratia 0.9.0. This release has been over a year in the making and provides many new features as well as a more consistent user experience. Unfortunately, I have had to make a lot of breaking changes; nothing too egregious, but most user-facing functions are affected. This release represents a solid base to move towards gratia version 1.0.0. Here, I describe what I broke as well as outline some of the major new features in the package.

Breaking changes

Several of the main user-facing functions return user data along side the variables created by those functions. For example, smooth_estimates() returns the values of the covariates at which a smooth is evaluated.

library("mgcv")
library("gratia")

df <- data_sim("eg1", n = 500, seed = 2)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")

smooth_estimates(m) # evaluate all smooths

# A tibble: 400 × 9
   .smooth .type .by   .estimate   .se       x0    x1    x2    x3
   <chr>   <chr> <chr>     <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
 1 s(x0)   TPRS  <NA>     -1.01  0.322 0.000663    NA    NA    NA
 2 s(x0)   TPRS  <NA>     -0.955 0.299 0.0107      NA    NA    NA
 3 s(x0)   TPRS  <NA>     -0.902 0.277 0.0208      NA    NA    NA
 4 s(x0)   TPRS  <NA>     -0.848 0.257 0.0309      NA    NA    NA
 5 s(x0)   TPRS  <NA>     -0.794 0.239 0.0410      NA    NA    NA
 6 s(x0)   TPRS  <NA>     -0.741 0.224 0.0510      NA    NA    NA
 7 s(x0)   TPRS  <NA>     -0.687 0.210 0.0611      NA    NA    NA
 8 s(x0)   TPRS  <NA>     -0.633 0.199 0.0712      NA    NA    NA
 9 s(x0)   TPRS  <NA>     -0.580 0.190 0.0813      NA    NA    NA
10 s(x0)   TPRS  <NA>     -0.526 0.184 0.0913      NA    NA    NA
# ℹ 390 more rows

smooth_estimates() returns info about the smooth, what is’s label is, what type it is, plus info relating to whether it is a factor by smooth. It isn’t unreasonable to think that a user might also use covariates with names type, for example, which would immediately cause a problem if they fitted a model with a s(x, by = type) smooth. We’d see a clash between the type that smooth_estimates() wants to add to the return object and the variable the user used which also needs to included in the return object.

Now, I should have foreseen this when I set about building gratia, but I didn’t, and so we arrive at March 2024 and the Great Renaming.

As you can see above, the variables that gratia’s functions add to returned objects are now prefixed with a period: .smooth instead of smooth. I also took the opportunity to be more consistent and clear about what variables are through their naming. Hence, smooth_estimates() now returns a variable .estimate where previously I have est.

Some functions have changed more than others. derivatives(), for example, used to have a data variable that stored the covariate values at which the derivative of a smooth was computed. This wasn’t very flexible unfortunately, and it wouldn’t work by smooths as at some point we’d need to also store the by variable name and info and you can’t easily stack factors with different levels without merging them. So now derivatives() more closely follows the conventions of smooth_estimates()

derivatives(m)

# A tibble: 400 × 12
   .smooth .by   .fs   .derivative   .se .crit .lower_ci .upper_ci      x0    x1
   <chr>   <chr> <chr>       <dbl> <dbl> <dbl>     <dbl>     <dbl>   <dbl> <dbl>
 1 s(x0)   <NA>  <NA>         5.32  2.94  1.96  -0.441       11.1  6.63e-4    NA
 2 s(x0)   <NA>  <NA>         5.32  2.94  1.96  -0.437       11.1  1.07e-2    NA
 3 s(x0)   <NA>  <NA>         5.32  2.93  1.96  -0.413       11.1  2.08e-2    NA
 4 s(x0)   <NA>  <NA>         5.32  2.89  1.96  -0.338       11.0  3.09e-2    NA
 5 s(x0)   <NA>  <NA>         5.33  2.82  1.96  -0.196       10.8  4.10e-2    NA
 6 s(x0)   <NA>  <NA>         5.33  2.71  1.96   0.00753     10.6  5.10e-2    NA
 7 s(x0)   <NA>  <NA>         5.33  2.58  1.96   0.263       10.4  6.11e-2    NA
 8 s(x0)   <NA>  <NA>         5.32  2.43  1.96   0.552       10.1  7.12e-2    NA
 9 s(x0)   <NA>  <NA>         5.32  2.28  1.96   0.847        9.79 8.13e-2    NA
10 s(x0)   <NA>  <NA>         5.31  2.14  1.96   1.12         9.49 9.13e-2    NA
# ℹ 390 more rows
# ℹ 2 more variables: x2 <dbl>, x3 <dbl>

There are still some inconsistencies; no .type in the derivatives() output, but the .fs variable is present. Going forward, I’ll be addressing these inconsistencies, but I’ll be able to do them in a way that shouldn’t break people’s code.

I didn’t make these changes lightly; I appreciate that these naming changes will cause code to break, not least a lot of my own. However, I truly believe that how things work now in 0.9.0 is the right way to combine user data with function-generated variables. Let’s face it, if you name your variables with a . prefix, that’s a you problem, not a me problem.

The other major change is in how spline-on-the-sphere (SOS) smooths are plotted. With version 3.5.0 of the ggplot2 package, the developers introduced a new guides system. I had been using coord_map() to generate a plot of an estimated SOS spline that looked like a sphere. Unfortunately, since I started using coord_map() the ggplot2 devs soft-deprecated the function and that meant that the new guide system wasn’t applied to coord_map(), and the current gratia plot code was now generating warnings. So, I have switched to coord_sf(), which is much better all round, but the way projection information is supplied to the coord is very different. So gone are the projection and orientation in their place we have crs, default_crs, and lims_method.

data(chl, package = "gamair")
m_chla <- bam(chl ~ s(lat, lon, bs = "sos"), data = chl, method = "fREML",
  discrete = TRUE)

draw(m_chla, crs = "+proj=wintri")

Spline on the sphere example, showing estimated spatial effect on ocean surface chlorophyll a.

The current implementation isn’t 100% finished; I need to be much more careful than I am in how I create the grid of points to evaluate the SOS spline at when it gets near to +/-90 degrees latitude or +/- 180 degrees longitude. I also need to figure out how to show as much of the smooth as is possible with a given projection. Notice in the lower left corner of the plot above how the high chlorophyll are is clipped a little.

New features

This release of gratia brings a lot of new functionality. For the full details, see the (change log)[https://gavinsimpson.github.io/gratia/news/index.html#gratia-090]. Below I highlight some of the more important improvements.

fitted_values() has started to be able to handle location-scale-shape families available in mgcv. I don’t yet have complete coverage of all such families, but as of 0.9.0, supported families are gaulss(), gammals(), gumbls(), gevlss(), shash(), twlss(), and ziplss(). The ocat() family is also now supported.

Soap film smoothers created with bs = “so” are now supported with their own plotting method. Previously, gratia would draw the smooth as a standard bivariate smooth.

response_derivatives() is a new function to estimate derivatives on the response scale and compute uncertainties in the estimates using posterior sampling. This is enabled by new function derivative_samples(), which is what does the actual posterior sampling.

On a related note, all the posterior sampling function in gratia * posterior_samples(), * fitted_samples(), * predicted_samples(), * derivative_samples(), * smooth_samples(), * simulate(). can now use the simple Metropolis Hastings sampler provided by mgcv, which instead of using a Gaussian approximation to the posterior, uses proposals from a Gaussian ot t distribution alternated with random walk proposals. And yes, posterior_samples() is a new function.

A new vignette on posterior sampling adds to the package documentation. It describes how and what we are sampling in relation to GAMs, and includes an example of the benefits of using the Metropolis Hastings sampler in some situations.

data_sim() gains a bunch of new functionality and (slightly) better documentation. The function can simulate data from a wider range of response distributions than previously, and it also includes several new “models”, known smooth effects, including data for use with mgcv’s new gfam() family, which allows you to model responses of mixed type (continuous, binary, count, etc.)

add_fitted_samples(), add_predicted_samples(), add_posterior_samples(), and add_smooth_samples() are new utility functions that add the respective draws from the posterior distribution to an existing data object for the covariate values in that object: obj |> add_posterior_draws(model).

ds <- data_slice(m, x2 = evenly(x2, n = 20))
ds |>
  add_fitted_samples(model = m)

# A tibble: 20 × 7
        x2    x0    x1    x3  .row .draw .fitted
     <dbl> <dbl> <dbl> <dbl> <int> <int>   <dbl>
 1 0.00361 0.488 0.501 0.494     1     1    3.46
 2 0.0560  0.488 0.501 0.494     2     1    6.20
 3 0.108   0.488 0.501 0.494     3     1    9.25
 4 0.161   0.488 0.501 0.494     4     1   12.3 
 5 0.213   0.488 0.501 0.494     5     1   14.1 
 6 0.265   0.488 0.501 0.494     6     1   13.9 
 7 0.318   0.488 0.501 0.494     7     1   12.2 
 8 0.370   0.488 0.501 0.494     8     1   10.3 
 9 0.422   0.488 0.501 0.494     9     1    8.93
10 0.475   0.488 0.501 0.494    10     1    8.17
11 0.527   0.488 0.501 0.494    11     1    7.86
12 0.579   0.488 0.501 0.494    12     1    7.94
13 0.632   0.488 0.501 0.494    13     1    8.15
14 0.684   0.488 0.501 0.494    14     1    8.05
15 0.736   0.488 0.501 0.494    15     1    7.34
16 0.789   0.488 0.501 0.494    16     1    6.23
17 0.841   0.488 0.501 0.494    17     1    5.31
18 0.894   0.488 0.501 0.494    18     1    4.90
19 0.946   0.488 0.501 0.494    19     1    4.88
20 0.998   0.488 0.501 0.494    20     1    4.98

draw.gam() can now group factor by smooths for a given factor into a single panel, rather than plotting the smooths for each level in separate panels. This is achieved via new argument grouped_by.

df2 <- data_sim("eg4", seed = 2, n = 1000)
m2 <- gam(y ~ fac + s(x2, by = fac) + s(x0),
  data = df2, method = "REML")

m2 |>
  draw(grouped_by = TRUE)

Plot produced by draw() using grouped_by = TRUE.

For a full list of changes, see the (change log)[https://gavinsimpson.github.io/gratia/news/index.html#gratia-090].

Defunct and deprecated

I have finally taken the decision to remove evaluate_smooth() from gratia. This function, alongside fderiv(), was the original functionality of the package from before it was even called gratia. It has long been superseded by smooth_estimates() however, and it became too difficult to maintain it.

This version of gratia also sees the deprecation of evaluate_parametric_term() and datagen(). The former is the counterpart to evaluate_smooth() but for parametric model terms; this has been superseded by parametric_terms(). datagen() was an early attempt at data_slice() and I never really used it as it wasn’t very flexible or useful. These functions will be removed from gratia by version 0.11.0 or 1.0.0, whichever of those happens first.

Fin

Version 0.9.0 of gratia is now on CRAN. I hope you find the new version of gratia useful and can bear the annoyances of code breaking. If you have thoughts about the new release, what could be improved and changed, let me know in the comments or in a GitHub Issue.

Using random effects in GAMs with mgcv

2021-02-02T17:50:00+01:00

There are lots of choices for fitting generalized linear mixed effects models within R, but if you want to include smooth functions of covariates, the choices are limited. One option is to fit the model using gamm() from the mgcv 📦 or gamm4() from the gamm4 📦, which use lme() (nlme 📦) or one of lmer() or glmer() (lme4 📦) under the hood respectively. The problem with doing things that way is that you get PQL fitting for non-Gaussian models (😱) and the range of families for handling non-Gaussian responses is quite limited, especially compared with the extended families now available with gam(). brms 📦 is a good option if you don’t want to do everything by hand, but the MCMC can be slow. Instead, we could use the equivalence between smooths and random effects and use gam() or bam() from mgcv. In this post I’ll show you how to do just that.

Smooths as random effects

The sorts of smooths we fit in mgcv are (typically) penalized smooths; we choose to use some number of basis functions (k), which sets an upper limit on the complexity — wiggliness — of the smooth, and then we estimate parameters for the model by maximizing a penalized log-likelihood. The log-likelihood of the model is a measure of the fit (or lack there of), while the penalty helps us avoid fitting overly complex smooths.

In the sorts of models that can be fitted in mgcv, the penalty is a function of the model coefficients, (), and a penalty matrix¹, which we write as (). The penalty then is (^{} ). The penalty matrix measures the wiggliness of each basis function (on the diagonal), and how the wiggliness of one basis function affects the wiggliness of another (the off diagonals). Just as the () scale the individual basis functions, they also scale penalty values in the penalty matrix; if you were to choose large weights for the most wiggly basis functions, the overall penalty (^{} ) would increase by a lot more than if we used smaller weights for those really wiggly functions.

The penalty then acts to shrink the estimates of () away from the values they would take if we weren’t doing a penalized fit and were instead fixing the wiggliness of the smooth at the maximum value dictated by (k). Put another way, the penalty shrinks the estimates for () towards zero.

Random effects also involve shrinkage. With a random effect we’re trying to model subject specific effects (subject-specific intercepts, or subject-specific “slopes” of covariates) without having to explicitly estimate a fixed effect parameter for each subject’s intercept or covariate effect. Instead we think of the subject-specific intercepts or “slopes” as coming from distribution, typically a Gaussian distribution, with mean 0 and some variance that is to be estimated. The larger this random effect variance, the greater the variation among subject-specific intercepts, “slopes” etc. The smaller the random effect variance, the closer to zero the estimated effects are pulled. As a result, random effects shrink to, varying degrees, the estimated subject-specific effects, and how much they do that is related to the random effect variance.

If I abuse all standards of notation and represent the estimated random effects with (), you might get the feeling that perhaps there is some link between whats happening when we estimate random effects shrinking the () towards zero, and the penalty applied to smooths that shrinks the () towards zero. If you did, you’d be right, there is. And if so, there must be a penalty matrix that we can write down for a random effect — if we assume that each random intercept or “slope” is a basis function, the penalty matrix () is a simple diagonal matrix, one row and column per subject, with a constant value on the diagonal (and zeroes everywhere else):

Penalty matrix corresponding to a random effect for a factor with 10 subjects (levels).

To complete the picture, when we fit a GAM, we’re maximising the penalised log-likelihood over both the model parameters () and a smoothness parameter, (). It’s () that actually controls how much price we pay for the wiggliness penalty as we add ( ^{} ) to the log-likelihood. It turns out that the variance of the random effect is equal to the scale parameter (the residual variance (^{2}_{}) in a Gaussian model for example) divided by ().

This link between smooths and random effects is really cool; not only are we able to estimate smooths and GAMs using the machinery of mixed effects models, we can also estimate random effects using all the penalized spline machinery available for GAMs in mgcv.

OK, so that was all really hand-wavy and skipped over a lot of math and theory², but I hope it gives you the intuition you need to understand how random effects are represented as smooths, through the identity penalty matrix.

Fitting random effects with mgcv

So much for the theory, let’s see how this all works in practice.

By way of an example, I’m going to use a data set from a study on the effects of testosterone on the growth of rats from Molenberghs and Verbeke (2000), which was analysed in Fahrmeir et al. (2013), from were I also obtained the data. In the experiment, 50 rats were randomly assigned to one of three groups; a control group or a group receiving low or high doses of Decapeptyl, which inhibits testosterone production. The experiment started when the rats were 45 days old and starting with the 50th day, the size of the rat’s head was measured via an X-ray image. You can download the data here.

For the example, we’ll use the following packages

pkgs <- c("mgcv", "lme4", "ggplot2", "vroom", "dplyr", "forcats", "tidyr")
## install.packages(pkgs, Ncpus = 4)
vapply(pkgs, library, logical(1), character.only = TRUE, logical.return = TRUE,
       quietly = TRUE)

   mgcv    lme4 ggplot2   vroom   dplyr forcats   tidyr 
   TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE

We’ll also need the development version of the gratia 📦, which we can install with the remotes 📦 (if you don’t have that installed, install it first)

## install.packages("remotes")
## remotes::install_github('gavinsimpson/gratia')
library('gratia')

We load the data — ignore the warning about new names as we deleted that column anyway

rats <- vroom('rats.txt', delim = '   ', col_types = 'dddddddddddd-')

New names:
* `` -> ...13

Next we need to prepare the data for modelling. The variable transf_time is the main covariate of interest. It relates to the age of the rats in days via the transformation

[(1 + ( - 45) / 10)]

where () is the time variable in the data set. We also need to convert the group variable to a factor with useful levels to create a treatment variable and we convert subject — an identifier for each individual rat — a factor

rats <- rats %>%
    mutate(treatment = fct_recode(factor(group, levels = c(2,1,3)),
                                  Low = '1',
                                  High = '3',
                                  Control = '2'),
           subject = factor(subject))

The number of observations per rat is variable, with only 22 of the 50 rats having the complete seven measurements by day 110

rats %>%
    na.omit() %>%
    count(subject) %>%
    count(n, name = "n_rats")

# A tibble: 7 x 2
      n n_rats
* <int>  <int>
1     1      4
2     2      3
3     3      5
4     4      9
5     5      5
6     6      2
7     7     22

so there’ll be no averaging the response within subjects and doing an ANOVA.

Before we fit the models an explore how to work with random effects in mgcv, we’ll plot the data

plt_labs <- labs(y = 'Head height (distance in pixels)',
                 x = 'Age in days',
                 colour = 'Treatment')
ggplot(rats, aes(x = time, y = response,
                 group = subject, colour = treatment)) +
    geom_line() +
    facet_wrap(~ treatment, ncol = 3) +
    plt_labs

Warning: Removed 98 row(s) containing missing values (geom_path).

Plot of the rat hormone therapy data

The model fitted in Fahrmeir et al. (2013) is

[y_{ij} = 0 + {0i} + 1 L_i t{ij} + 2 H_i t{ij} + 3 C_i t{ij} + {1i} t{ij} + _{ij}]

where

(_0) is the population mean of the response at the start of the treatment
(L_i), (H_i), (C_i) are dummy variables encoding for each treatment group
(_{0i}) is the rat-specific mean (random intercept)
({qi} t{ij}) is the rat-specific effect of transf_time (random slope)

If this isn’t very clear — it took me a little while to grok what this meant and translate it to R speak — note that each of (_1), (_2), and (_3) are associated with an interaction between the dummy variable coding for the treatment and the time variable. So we have a model with an intercept and three interaction terms with no main effects.

In lmer() we can fit this model with (ignore the singular fit warning for now)

m1_lmer <- lmer(response ~ treatment:transf_time +
                    (1 | subject) + (0 + transf_time | subject),
                data = rats)

boundary (singular) fit: see ?isSingular

If you’re not familiar with this model specification for the random effects, it specifies uncorrelated random effects for the subject-specific means (random intercept; (1 | subject)) and the subject_specific effects of transf_time (random slope; (0 + transf_time | subject)). The 0 in the formula for the latter suppresses the (random) intercept as we already included that as a separate term.

The reason we’re fitting uncorrelated random effects is because that’s all mgcv can fit; there’s no way to encode a covariance term between the two random effects.

The equivalent model fitted using gam() is

m1_gam <- gam(response ~ treatment:transf_time +
                  s(subject, bs = 're') +
                  s(subject, transf_time, bs = 're'),
              data = rats, method = 'REML')

Note:

we specify two separate random effect smooths, one per random term,
we indicate that the smooth should be a random effect with bs = ‘re’,
any grouping variables must be coded as a factor — that’s why we converted subject (which is an integer vector) to a factor right after importing the data.

Let’s compare the fixed effect terms; first for the lmer() version

fixef(m1_lmer)

                 (Intercept) treatmentControl:transf_time 
                   68.607386                     6.871128 
    treatmentLow:transf_time    treatmentHigh:transf_time 
                    7.506897                     7.313854

and for the gam() version

coef(m1_gam)[1:4]

                 (Intercept) treatmentControl:transf_time 
                   68.607385                     6.871130 
    treatmentLow:transf_time    treatmentHigh:transf_time 
                    7.506897                     7.313859

which are close enough.

Next let’s look at the estimated variances of the random effect terms. First for the lmer() model:

summary(m1_lmer)$varcor

 Groups    Name        Std.Dev.
 subject   (Intercept) 1.8881  
 subject.1 transf_time 0.0000  
 Residual              1.2020

and now for the gam() model

variance_comp(m1_gam)

# A tibble: 3 x 5
  component               variance std_dev lower_ci upper_ci
  <chr>                      <dbl>   <dbl>    <dbl>    <dbl>
1 s(subject)             3.56      1.89    1.51e+ 0  2.36e 0
2 s(subject,transf_time) 0.0000257 0.00507 8.21e-42  3.14e36
3 scale                  1.44      1.20    1.09e+ 0  1.33e 0

Apart from being as close for the differences not to matter, we should also note that the variance for the rat-specific effect of transf_time is effectively 0. This is likely the cause of the singular fit warning from lmer(). The lower_ci and upper_ci variables indicate the limits of a 95% confidence interval on the standard deviation of each variance component; the coverage can be controlled via the coverage argument to variance_comp(). The confidence interval for the rat-specific time effect variance is huge, again indicating that there really isn’t much variation at all in this component.

Here we used the variance_comp() function from gratia to extract the variance components, which expresses the random effects as their equivalent variance components that you’d see in a mixed model output. variance_comp() is a simple wrapper to mgcv::gam.vcomp(), which is doing all the hard work, but variance_comp() suppresses the annoying printed output produced by gam.vcomp() and returns the variance components as a tibble.

You can see a nicer version of the variance components for lmer() by printing the whole summary() but it produces a lot of output; the bit we are interested in just now is in the section labelled Random effects:

summary(m1_lmer)

Linear mixed model fit by REML ['lmerMod']
Formula: response ~ treatment:transf_time + (1 | subject) + (0 + transf_time |  
    subject)
   Data: rats

REML criterion at convergence: 932.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.25576 -0.65898 -0.01164  0.58358  2.88310 

Random effects:
 Groups    Name        Variance Std.Dev.
 subject   (Intercept) 3.565    1.888   
 subject.1 transf_time 0.000    0.000   
 Residual              1.445    1.202   
Number of obs: 252, groups:  subject, 50

Fixed effects:
                             Estimate Std. Error t value
(Intercept)                   68.6074     0.3312  207.13
treatmentControl:transf_time   6.8711     0.2276   30.19
treatmentLow:transf_time       7.5069     0.2252   33.34
treatmentHigh:transf_time      7.3139     0.2808   26.05

Correlation of Fixed Effects:
            (Intr) trtC:_ trtL:_
trtmntCnt:_ -0.340              
trtmntLw:t_ -0.351  0.119       
trtmntHgh:_ -0.327  0.111  0.115
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular

One of the nice things about the output from the gam() model is that the summary() contains a test for the random effects

summary(m1_gam)

Family: gaussian 
Link function: identity 

Formula:
response ~ treatment:transf_time + s(subject, bs = "re") + s(subject, 
    transf_time, bs = "re")

Parametric coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   68.6074     0.3312  207.13   <2e-16 ***
treatmentControl:transf_time   6.8711     0.2276   30.19   <2e-16 ***
treatmentLow:transf_time       7.5069     0.2252   33.34   <2e-16 ***
treatmentHigh:transf_time      7.3139     0.2808   26.05   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                             edf Ref.df     F p-value    
s(subject)             43.723610     49 11.51  <2e-16 ***
s(subject,transf_time)  0.001387     47  0.00   0.744    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.926   Deviance explained =   94%
-REML =  466.2  Scale est. = 1.4448    n = 252

This test is due to Wood (2013). It is based on a likelihood ratio test and uses a reference distribution that is appropriate for testing a null hypothesis that is on the boundary of the parameter space (the null, that the variance is 0, is on the lower boundary of possible values for the parameter — you can’t have a negative variance!)

There is little evidence in support of the rat-specific time effects, reflecting what we saw when we looked at the variance components above.

If we look at the estimated degrees of freedom (EDF; the edf column) for each of the “smooths” we see the shrinkage in action. The Ref.df column contains the maximum degrees of freedom for each term, used in the calculation of the p value. The rat-specific mean distances — the s(subject) term — have only been shrunk a little to an EDF of ~43.7. In contrast, the EDF for the rat-specific effects of time has been shrunk to effectively zero.

The EDFs for smooths can be extracted from a fitted model with edf()

edf(m1_gam)

# A tibble: 2 x 2
  smooth                      edf
  <chr>                     <dbl>
1 s(subject)             43.7    
2 s(subject,transf_time)  0.00139

To plot the estimated time effects for each rat, we need to produce a new data frame with values of the range of transf_time for each rat, and include the relevant treatment value for the rat also. We do this with expand() and nesting() from the tidyr 📦.

new_data <- tidyr::expand(rats, nesting(subject, treatment),
                          transf_time = unique(transf_time))

which we then use to predict from the model

m1_pred <- bind_cols(new_data,
                     as.data.frame(predict(m1_gam, newdata = new_data,
                                           se.fit = TRUE)))

which gives us something we can plot easily with ggplot

ggplot(m1_pred, aes(x = transf_time, y = fit, group = subject,
                    colour = treatment)) +
    geom_line() +
    facet_wrap(~ treatment) +
    plt_labs

Fitted growth curves from the mixed effect model fitted using gam()

We can also compare the fitted curves with the observed data

ggplot(m1_pred, aes(x = transf_time, y = fit, group = subject,
                    colour = treatment)) +
    geom_line() +
    geom_point(data = rats, aes(y = response)) +
    facet_wrap(~ subject) +
    plt_labs

Warning: Removed 98 rows containing missing values (geom_point).

Observed values and fitted growth curves from the mixed effect model fitted using gam()

A simpler model, which drops the rat-specific effects of transf_time is

[y_{ij} = 0 + {0i} + 1 L_i t{ij} + 2 H_i t{ij} + 3 C_i t{ij} + _{ij}]

which drops the ({qi} t{ij}) term, excluding the rat-specific time effects from the model.

m2_lmer <- lmer(response ~ treatment:transf_time +
                    (1 | subject),
                data = rats)
m2_gam <- gam(response ~ treatment:transf_time +
                  s(subject, bs = 're'),
              data = rats, method = 'REML')

As we should now expected, the two models have estimated variance components that are essentially equivalent. First for the lmer() fit:

summary(m2_lmer)$varcor

 Groups   Name        Std.Dev.
 subject  (Intercept) 1.8881  
 Residual             1.2020

and now for the gam() version

variance_comp(m2_gam)

# A tibble: 2 x 5
  component  variance std_dev lower_ci upper_ci
  <chr>         <dbl>   <dbl>    <dbl>    <dbl>
1 s(subject)     3.56    1.89     1.51     2.36
2 scale          1.44    1.20     1.09     1.33

We could use the anova() method for “gam” fits but for fully penalized terms like random effects, the test isn’t very good and p values can be badly biased. Wood (2017, p. 315) says of the test “As expected, the test is clearly useless for comparing models differing in [their] random effect structure.” So, maybe give this one a miss.

Using AIC() to compare the models is also an option:

AIC(m2_gam, m1_gam)

             df      AIC
m2_gam 48.98553 852.9313
m1_gam 48.98931 852.9371

AIC clearly favours the simpler model as the fits of the two models are essentially the same. Note that because the EDF of the s(subject, transf_time) was so close to zero, we don’t pay much of a penalty for including this term in the model, and hence the AICs of the two models are very similar (typically we’d expect that where two models have the same fit, the AIC for the more complex one would be the larger value).

Note that the AIC computed for the gam() model is a conditional AIC, where the likelihood is of all model coefficients set to their maximum penalized likelihood estimates. The AIC for an lmer() fit is a marginal AIC, where all the penalized coefficients are viewed as random effects and integrated out of the joint density of the response and random effects.

The conditional AIC for the gam() fit would be anti-conservative, especially so in the case of models containing random effects. The upshot of that is that the conditional AIC would typically choose a model with a random effects structure that isn’t in the true model if no steps were taking to account for smoothness parameter selection in the EDF calculation. The AIC() method for gam() fits applies a suitable correction to the model EDF to account for smoothness parameter selection, resulting in an information criterion that has mostly good properties.

draw(m2_gam, parametric = FALSE)

QQ-plot of the rat-specific mean distance effects

We need parametric = FALSE here because at the time of writing there is a bug in the code that handles parametric fixed effects.

It’s not all good news

It all seems a little too good to be true, doesn’t it! We have a way to fit models with random effects that works well, allows for tests of random effect terms against a null of 0 variance, and which allows us to use all the extended families that gam() allows including some complex distributional model families.

Well, as they say, there is no free lunch; the main issue with fitting random effects as smooths within gam() fits is to do with efficiency. lmer() and glmer() use very efficient algorithms for fitting the model, including the use of sparse matrices for the model terms. Because gam() fits need the full penalty matrix for each random effect, and gam() currently doesn’t use any sparse matrices for efficient computation, gam() fits are going to get very slow as the number of random effects increases: the larger the number of subjects (levels) the slower things will get. The same will happen the greater the number of complex random effect terms you include the model.

Basically, if you have random effects with many hundreds or thousands of levels (subjects), expect the time it takes to fit your gam() to increase dramatically, and expect the memory usage to increase markedly too.

Also, running summary() on a model with random effects with many levels or lots of random effects terms is also going to be slow: the test for the random effect terms is quite computationally expensive. If you are mostly interested in the other model terms, setting the re.test argument to FALSE will skip the tests for random effects (and other terms with zero dimension null space), allowing the summary for the other terms to be computed quickly.

Fin

In this post I showed how random effects can be represented as smooths and how to use them practically in in gam() models. I hope you found it useful. If you have any comments or questions, let me know them in the comments below.

References

Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). Regression: Models, methods and applications. Springer Berlin Heidelberg doi:10.1007/978-3-642-34333-9.

Molenberghs, G., and Verbeke, G. (2000). Linear mixed models for longitudinal data. Springer, New York, NY doi:10.1007/978-1-4419-0300-6.

Wood, S. N. (2013). A simple test for random effects in regression models. Biometrika 100, 1005–1010. doi:10.1093/biomet/ast038.

Wood, S. N. (2017). Generalized Additive Models: An introduction with R, second edition. CRC Press.

or matrices; smooths can have multiple penalty matrices that are stacked block-diagonally in (). For simplicity’s sake I’m just going to assume a single penalty matrix.↩
You can read about this in brief in §5.8 of Wood (2017) and follow up via references therein↩

Getting data from the Canada Covid-19 Tracker using R

2021-01-31T17:49:00+01:00

Last semester (Fall 2020) I taught a new course in healthcare data science for the Johnson Shoyama Graduate School in Public Policy. One of the final topics of the course was querying application programming interfaces (APIs) from within R. The example we used was querying data on the Covid 19 pandemic from the Covid-19 Tracker Canada, which has a simple API that’s easy to work with. In this post I’ll show how we accessed the API from within R and converted the query responses into something we can work with easily.

There are many ways of querying APIs in R via a range of packages. Here, I’m going to use httr 📦 to query the API and jsonlite 📦 to convert what the API responds to our query with into something more useful. The packages we need are listed in the chunk below — if you don’t have them, uncomment the install.packages() line and change Ncpus to something suitble for your computer.

pkgs <- c('httr', 'jsonlite', 'dplyr', 'ggplot2', 'purrr')
## install.packages(pkgs, Ncpus = 4)
vapply(pkgs, library, logical(1), logical.return = TRUE, character.only = TRUE)

    httr jsonlite    dplyr  ggplot2    purrr 
    TRUE     TRUE     TRUE     TRUE     TRUE

The kind of API we’re going to query is a RESTful API — REpresentational State Transfer. To do the query we need identify the resource we want to query and then send the query using HTTP, the HyperText Transfer Protocol. The resource identity is specified using a uniform resource identifier or URI

Graphic showing the parts of a URI. Source: Programming Skills for Data Science (Ross & Freeman, 2018)

The URI comprises four parts

the protocol
the base URI
the endpoint
additional query parameters

For the Covid-19 Tracker Canada, we’ll use the HTTPS protocol for secure HTTP, and its base URI is api.covid19tracker.ca. The endpoint is the specific location of the data you want to access. For the API we’re querying, endpoints include

/reports
/cases
/fatalities
/provinces

Endpoints can also allow multiple sub-resources, these are variables and take the form :var_name. For example, the /reports/province endpoint allows the province to be specified as a sub-resource. It is documented as /reports/province/:code, so we would specify endpoints as /reports/province/SK etc, where we are setting :code to SK.

The final part of the URI are the query parameters and they allow some fine control over what is requested from the endpoint. These are added as key-value pairs, following a ?, and pairs are separated with &. The key is the name of the parameter, and the value is what you want to pass to that parameter. For example, when querying cases, we can specify the province and how many cases are returned per page using

/cases?province=ON&per_page=50

Which endpoints and query parameters are supported are documented in the specific API you are trying to access, so always take some time to familiarise yourself with the API itself. For the Covid-19 Tracker Canada the documentation is also at api.covid19tracker.ca.

It’s usually best to build the URI up from these parts stored as separate objects within R

base  <- "https://api.covid19tracker.ca"
ep    <- "/reports/province/sk"
query <- "?date=2021-01-31"
req   <- paste0(base, ep, query)
req

[1] "https://api.covid19tracker.ca/reports/province/sk?date=2021-01-31"

The HTTP request involves using a verb and the URI — here we will use the GET verb. In httr 📦 the GET verb is found in the GET() function

response <- GET(req)

The response consists of two parts

the headers
the body

The headers contain information about the request and response, while the body contains the result of the query. You can access these components of the response using headers() and content() respectively.

When you print response you’ll see a brief summary of the response metadata

response

Response [https://api.covid19tracker.ca/reports/province/sk?date=2021-01-31]
  Date: 2021-01-31 22:42
  Status: 200
  Content-Type: application/json
  Size: 488 B

The status code is important; 200 means success and anything else likely indicates some form of failure. Keep an I on the status code of your queries. If you’re wrapping these codes in a function, the warn_for_status() and stop_for_status() functions to query the status and which throw a warning or an error if the request failed respectively.

The body of the response can be accessed as a generic R list, as the raw bytes of the response, or as plain text. When viewed as text, we see that the text format is JSON

jsonlite::prettify(content(response, 'text', encoding = 'UTF-8'))

{
    "province": "sk",
    "data": [
        {
            "date": "2021-01-31",
            "change_cases": 238,
            "change_fatalities": 4,
            "change_tests": 2459,
            "change_hospitalizations": -3,
            "change_criticals": 3,
            "change_recoveries": 223,
            "change_vaccinations": 120,
            "change_vaccines_distributed": 0,
            "change_vaccinated": 0,
            "total_cases": 23863,
            "total_fatalities": 304,
            "total_tests": 508638,
            "total_hospitalizations": 203,
            "total_criticals": 31,
            "total_recoveries": 21026,
            "total_vaccinations": 35359,
            "total_vaccines_distributed": 32725,
            "total_vaccinated": 4637
        }
    ]
}

Above, I used the prettify() function to display the JSON in a human-readable format. Note also that I’m specifying the encoding explicitly to be UTF-8 as that’s what my Linux system uses. If you’re not sure about the encoding for your system, just leave the encoding argument off and you’ll see a message indicating what encoding was used.

To actually parse the JSON into a similar R object we use jsonlite::fromJSON()

parsed <- fromJSON(content(response, 'text', encoding = 'UTF-8'))
str(parsed)

List of 2
 $ province: chr "sk"
 $ data    :'data.frame':   1 obs. of  19 variables:
  ..$ date                       : chr "2021-01-31"
  ..$ change_cases               : int 238
  ..$ change_fatalities          : int 4
  ..$ change_tests               : int 2459
  ..$ change_hospitalizations    : int -3
  ..$ change_criticals           : int 3
  ..$ change_recoveries          : int 223
  ..$ change_vaccinations        : int 120
  ..$ change_vaccines_distributed: int 0
  ..$ change_vaccinated          : int 0
  ..$ total_cases                : int 23863
  ..$ total_fatalities           : int 304
  ..$ total_tests                : int 508638
  ..$ total_hospitalizations     : int 203
  ..$ total_criticals            : int 31
  ..$ total_recoveries           : int 21026
  ..$ total_vaccinations         : int 35359
  ..$ total_vaccines_distributed : int 32725
  ..$ total_vaccinated           : int 4637

What we’re most interested in is the $data</code> component, but you can see that jsonlite 📦 has converted the JSON to an R list and where appropriate has converted arrays to data frames, as for <code>$data here. Exactly what is returned by the API will be specific to each API, so read the docmentation for the API you want and look at the structure of what is returned to identify the names of relevant components etc.

Covid-19 cases per day

Now that we’ve had a crash course in querying an API, let’s do something substantive and query the Covid-19 case data for my adopted home province of Saskatchewan. For this we want the /reports endpoint and we can specify the province as a sub-resource.

base <- 'https://api.covid19tracker.ca'
ep <- '/reports/province/sk'
req <- paste0(base, ep)
response <- GET(req)
cases <- response %>%
  content(as = 'text', encoding = 'UTF-8') %>%
  fromJSON() %>%
  pluck('data') %>%
  as_tibble()
cases

# A tibble: 373 x 19
   date  change_cases change_fataliti… change_tests change_hospital…
   <chr>        <int>            <int>        <int>            <int>
 1 2020…           NA               NA            0                0
 2 2020…           NA               NA            0                0
 3 2020…           NA               NA            0                0
 4 2020…           NA               NA            0                0
 5 2020…           NA               NA            0                0
 6 2020…           NA               NA            0                0
 7 2020…           NA               NA            0                0
 8 2020…           NA               NA            0                0
 9 2020…           NA               NA            0                0
10 2020…           NA               NA            0                0
# … with 363 more rows, and 14 more variables: change_criticals <int>,
#   change_recoveries <int>, change_vaccinations <int>,
#   change_vaccines_distributed <int>, change_vaccinated <int>,
#   total_cases <int>, total_fatalities <int>, total_tests <int>,
#   total_hospitalizations <int>, total_criticals <int>,
#   total_recoveries <int>, total_vaccinations <int>,
#   total_vaccines_distributed <int>, total_vaccinated <int>

At the moment the date variable is stored as a simple character vector. If we convert that to a ‘Date’ object, ggplot2 📦 will draw a nicely formatted time axis for us

cases %>%
  mutate(date = as.Date(date)) %>%
  ggplot(aes(x = date, y = change_cases)) +
    geom_line() +
    labs(x = NULL, y = 'Cases',
         title = 'Daily Covid-19 cases in Saskatchewan',
         caption = 'Source: N. Little. COVID-19 Tracker Canada (2021), COVID19tracker.ca')

Warning: Removed 47 row(s) containing missing values (geom_path).

Daily Covid-19 Cases in Saskatchewan

Yeah, we’re not doing very well in this province 😞🤬

Hope you enjoyed the post — if you have comments or questions, ask them in the Comment section below.

References

Two new versions of gratia released

2021-01-30T22:00:00+01:00

While the Covid-19 pandemic and teaching a new course in the fall put paid to most of my development time last year, some time off work this January allowed me time to work on gratia 📦 again. I released 0.5.0 to CRAN in part to fix an issue with tests not running on the new M1 chips from Apple because I wasn’t using vdiffr 📦 conditionally. Version 0.5.1 followed shortly thereafter as I’d messed up an argument name in smooth_estimates(), a new function that I hope will allow development to proceed more quickly and mke it easier to maintain code and extend functionality to cover a greater range of model types. Read on to fin out more about smooth_estimates() and what else was in these two releases.

Evaluating smooths with `smooth_estimates()`

For a while now I’ve realised that the way I’d implemented evaluate_smooth() wasn’t great. Some design decisions I took earlier on added a lot of unnecessary complexity to the function through handling of factor by smooths, and which didn’t really work properly in the context of a GAM where the same variable could be in multiple smooth terms.

My original plan was to use a facetted plot for factor by variable smooths, and so when you selected a model term (more on that later), if that term was a factor by smooth, instead of just pulling in a single smooth, I would pull in all of the smooths associated with the factor by. Handling this got complicated and resulted in some kludgy, messy code that was prone to failure when used with a more specialised smooth or a more complex model.

Additionally, how I initially implemented selection of model terms was a bit silly; a user could pass a string for a variable that would be matched against the labels that mgcv 📦 uses for smooth. Any instance of the term in any smooth would then get selected, which is not usually what is wanted when working with complex models with multiple smooths, some of which might contin the same variable.

Because of this, in the summer I decided to completely rewrite evaluate_smooth(). Then I realised this would not be a good idea as I was going to break a lot of existing code, including code we’d written in support of papers that had been published and which used evaluate_smooth(). Instead, I decided to start from a clean slate with a new function that didn’t do any of the silly things I’d messed up evaluate_smooth() with, and which would be much simpler to maintain and develop for a wider range of complex distributional models.

In writing smooth_estimates() I also came up with a standard way to represent all evaluations of a smooth, regardless of type. The nice thing about this is that it’s easy to return a tibble containing all the values of the evaluated smooth for many smooths at once, something you couldn’t do with evaluate_smooth().

The idea behind evaluate_smooth() and smooth_estimates() is to return a tibble of values of the smooth evaluated at a grid of n points over each of the covariates involved in that smooth.

library('mgcv')
library('gratia')
library('dplyr')
library('tidyr')

dat <- data_sim("eg1", seed = 42)
gam_model <- gam(y ~ s(x0) + s(x1, bs = "cr") + s(x2, bs = "bs") + s(x3, bs = "ps"),
                 data = dat, method = "REML")

smooth_estimates(gam_model)

# A tibble: 400 x 9
   smooth type  by       est    se       x0    x1    x2    x3
   <chr>  <chr> <chr>  <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
 1 s(x0)  TPRS  NA    -1.34  0.392 0.000239    NA    NA    NA
 2 s(x0)  TPRS  NA    -1.26  0.366 0.0103      NA    NA    NA
 3 s(x0)  TPRS  NA    -1.19  0.342 0.0204      NA    NA    NA
 4 s(x0)  TPRS  NA    -1.11  0.319 0.0304      NA    NA    NA
 5 s(x0)  TPRS  NA    -1.03  0.298 0.0405      NA    NA    NA
 6 s(x0)  TPRS  NA    -0.956 0.280 0.0506      NA    NA    NA
 7 s(x0)  TPRS  NA    -0.881 0.264 0.0606      NA    NA    NA
 8 s(x0)  TPRS  NA    -0.806 0.250 0.0707      NA    NA    NA
 9 s(x0)  TPRS  NA    -0.733 0.238 0.0807      NA    NA    NA
10 s(x0)  TPRS  NA    -0.661 0.229 0.0908      NA    NA    NA
# … with 390 more rows

This seems a little wasteful — all those NA columns 😱 — but the output is a consistent wa to represent smooths, regardless of the number of covariates etc.

I’m toying with returning the tibble in a nested fashion with nest(), something like

sm <- smooth_estimates(gam_model) %>% 
  nest(values = c(est, se), data = starts_with('x'))
sm

# A tibble: 4 x 5
  smooth type     by    values             data              
  <chr>  <chr>    <chr> <list>             <list>            
1 s(x0)  TPRS     NA    <tibble [100 × 2]> <tibble [100 × 4]>
2 s(x1)  CRS      NA    <tibble [100 × 2]> <tibble [100 × 4]>
3 s(x2)  B spline NA    <tibble [100 × 2]> <tibble [100 × 4]>
4 s(x3)  P spline NA    <tibble [100 × 2]> <tibble [100 × 4]>

which I think is much neater, but does require extra steps from the user to just use the output

sm %>%
  unnest(cols = c(values, data))

# A tibble: 400 x 9
   smooth type  by       est    se       x0    x1    x2    x3
   <chr>  <chr> <chr>  <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl>
 1 s(x0)  TPRS  NA    -1.34  0.392 0.000239    NA    NA    NA
 2 s(x0)  TPRS  NA    -1.26  0.366 0.0103      NA    NA    NA
 3 s(x0)  TPRS  NA    -1.19  0.342 0.0204      NA    NA    NA
 4 s(x0)  TPRS  NA    -1.11  0.319 0.0304      NA    NA    NA
 5 s(x0)  TPRS  NA    -1.03  0.298 0.0405      NA    NA    NA
 6 s(x0)  TPRS  NA    -0.956 0.280 0.0506      NA    NA    NA
 7 s(x0)  TPRS  NA    -0.881 0.264 0.0606      NA    NA    NA
 8 s(x0)  TPRS  NA    -0.806 0.250 0.0707      NA    NA    NA
 9 s(x0)  TPRS  NA    -0.733 0.238 0.0807      NA    NA    NA
10 s(x0)  TPRS  NA    -0.661 0.229 0.0908      NA    NA    NA
# … with 390 more rows

Internally, the individual smooths are nested by default as that makes it easy to join the tibbles for multiple smooth together. As such, the unnested-ness of the current behaviour requires an explicit extra step within smooth_estimates().

If you have thoughts about this, let me know in the comments below.

smooth_estimates() is going to supersede evaluate_smooth(), and currently it can handle pretty much everything that evaluate_smooth() can do. That doesn’t mean evaluate_smooth() is going anywhere; as I mentioned above, I don’t want to break old code, so as log as it doesn’t take too much time to maintain evaluate_smooth() isn’t hurting anyone if I put it out to pasture.

Version 0.5.0 introduced smooth_estimates() which could only handle very simple univariate smooths, but version 0.5.1 expanded those capabilities. There are a few special smooths that I haven’t yet added capabilities for, including Markov random field smooths and soap film smooths. Support for those will be added by the time version 0.6.0 hits CRAN later this year.

Partial residuals

Version 0.4.0 introduced the ability to add partial residuals to plots of smooths. Version 0.5.0 exposes this functionality for computing partial residuals via new function partial_residuals()

partial_residuals(gam_model)

# A tibble: 400 x 4
    `s(x0)` `s(x1)` `s(x2)` `s(x3)`
      <dbl>   <dbl>   <dbl>   <dbl>
 1 -0.236    -1.20   -2.19   0.730 
 2  0.00545   0.640  -1.79   1.10  
 3  1.58      1.66    5.59   1.13  
 4 -1.24     -1.83   -0.892 -0.783 
 5 -2.21     -0.100  -2.71  -3.10  
 6  1.27     -1.20    3.93   0.0835
 7 -0.599     2.94   -0.793 -1.10  
 8  1.59      0.402   7.04   2.09  
 9  2.74      0.449   7.33   2.45  
10  1.11     -0.263   0.730  0.703 
# … with 390 more rows

The names are currently non-standard — hence all the backticks — and I might change that if I can think of a short hand way to refer to smooths that still allows referencing them uniquely when there are things like factor by smooths involved.

I also added an add_partial_residuals(), to add the partial residuals to an existing data frame

dat %>%
  add_partial_residuals(model = gam_model)

# A tibble: 400 x 14
       y    x0     x1    x2    x3     f    f0    f1     f2    f3  `s(x0)`
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
 1  2.99 0.915 0.0227 0.909 0.402  1.62 0.529  1.05 0.0397     0 -0.236  
 2  4.70 0.937 0.513  0.900 0.432  3.25 0.393  2.79 0.0630     0  0.00545
 3 13.9  0.286 0.631  0.192 0.664 13.5  1.57   3.53 8.41       0  1.58   
 4  5.71 0.830 0.419  0.532 0.182  6.12 1.02   2.31 2.79       0 -1.24   
 5  7.63 0.642 0.879  0.522 0.838 10.4  1.80   5.80 2.76       0 -2.21   
 6  9.80 0.519 0.108  0.160 0.917 10.4  2.00   1.24 7.18       0  1.27   
 7 10.4  0.737 0.980  0.520 0.798 11.3  1.47   7.10 2.75       0 -0.599  
 8 12.8  0.135 0.265  0.225 0.503 11.4  0.821  1.70 8.90       0  1.59   
 9 13.8  0.657 0.0843 0.282 0.254 11.1  1.76   1.18 8.20       0  2.74   
10  7.51 0.705 0.386  0.504 0.667  6.50 1.60   2.16 2.74       0  1.11   
# … with 390 more rows, and 3 more variables: `s(x1)` <dbl>, `s(x2)` <dbl>,
#   `s(x3)` <dbl>

but since implementing this I am now questioning whether this is a good thing or rather whether the implementation is a good thing; there’s nothing in the code currently to ensure that the data you provided matches the order of the data used to fit the model — caveat emptor!

Penalty matrices

I’ve been adding functions to gratia that will be helpful when teaching GAMs; I added basis() a while back and in the 0.5.1 release I added penalty(), for extracting and tidying penalty matrices of smooths from fitted GAM models.

penalty(gam_model)

# A tibble: 324 x 6
   smooth type  penalty row   col   value
   <chr>  <chr> <chr>   <chr> <chr> <dbl>
 1 s(x0)  TPRS  s(x0)   f1    f1     9.81
 2 s(x0)  TPRS  s(x0)   f1    f2    -1.45
 3 s(x0)  TPRS  s(x0)   f1    f3    -5.00
 4 s(x0)  TPRS  s(x0)   f1    f4    -1.34
 5 s(x0)  TPRS  s(x0)   f1    f5    -6.24
 6 s(x0)  TPRS  s(x0)   f1    f6     3.90
 7 s(x0)  TPRS  s(x0)   f1    f7    -7.74
 8 s(x0)  TPRS  s(x0)   f1    f8    -1.79
 9 s(x0)  TPRS  s(x0)   f1    f9     0   
10 s(x0)  TPRS  s(x0)   f2    f1    -1.45
# … with 314 more rows

There is a draw() method also, to plot the penalty matrix

gam_model %>% 
  penalty() %>% 
  draw()

Penalty matrices for smooths from the fitted GAM. Note that in the released version you need to visually flip the y-axis so that diagonal runs top-left to bottom-right to match with how the matrix is actually arranged; this is fixed in the GitHub version.

It was pointed out that the way this is plotted is not very intuitive if you’re trying to map the way the penalty matrix is written to what’s shown in the plot — you have to flip the y-axis. This is due to how geom_raster() draws things. I have fixed this, but it’s only fixed in the GitHub version of the package, not a current release version.

Colour scales

draw.gam() and some related draw() methods now allow you to configure the colour scales used to plot GAMs. Available options include discrete_colour, continuous_colour, and continuous_fill, that take a suitable scale allowing you to change the colour scheme used etc:

dat2 <- data_sim("eg2", n = 1000, dist = "normal", scale = 1, seed = 2)
gam_model2 <- gam(y ~ s(x, z, k = 40), data = dat2, method = "REML")
draw(gam_model2, n_contour = 5,
     continuous_fill = ggplot2::scale_fill_distiller(palette = "Spectral",
                                                     type = "div"))

Changing the fill scale used by draw()

`constant` and `fun`

draw.gam() can now plot smooths after addition of a constant and transformation via a function. This can be used to put smooths (sort of) on the response scale. For example, in the code below, I add the model intercept to each smooth when plotting

b0 <- coef(gam_model)[1]
draw(gam_model, constant = b0)

Plotting smooths, rescaling the y-axis to include the model intercept term in the scale.

I plan to add an argument response, which would take a logical to indicate if you wan to plot on the response scale. If response = TRUE, it would override anything passed to constant and fun, such that draw.gam() would just do the right thing, and figure out from the model what constant and inverse link function to use. Watch out for that in 0.6.0.

Excluding or selecting terms to include in model predictions

predict.gam() allows the user to either exclude or specifically include only selected terms in model predictions. Version 0.5.0 added the same functionality in simulate.gam() and predicted_samples(), by allowing you to pass along an exclude or terms argument to predict.gam() that is used in both of these functions.

Summary

All in all, these are not major changes to the functionality of gratia, but the ground work laid in smooth_estimates() should allow me to address lots of the outstanding bugs related to handling complex model and some complex smooth types, and I’m pretty excited about that.

Extrapolating with B splines and GAMs

2020-06-03T21:00:00+02:00

An issue that often crops up when modelling with generlaized additive models (GAMs), especially with time series or spatial data, is how to extrapolate beyond the range of the data used to train the model? The issue arises because GAMs use splines to learn from the data using basis functions. The splines themselves are built from basis functions that are typically setup in terms of the data used to fit the model. If there are no basis functions beyond the range of the input data, what exactly is being used if we want to extrapolate? A related issue is that of the wiggliness penalty; depending on the type of basis used, the penalty could extend over the entire real line (-∞–∞), or only over the range of the input data. In this post I want to take a practical look the extrapolation behaviour of splines in GAMs fitted with the mgcv package for R. In particular I want to illustrate how flexible the B spline basis is.

A lot of what I discuss in the post draws heavily on the help page in mgcv for the B spline basis — ?mgcv::b.spline — and a recent email discussion with Alex Hayes, Dave Miller, and Eric Pedersen, though what I write here reflects my own input to that discussion.

I was initially minded to look into this again after reading a new preprint on low-rank approximations to a Gaussian process (GP; Riutort-Mayol et al., 2020), where, among other things, the authors compare the behaviour of the exact GP model with their low-rank version and with a thin plate regression spline (TPRS). The TPRS is the sort of thing you’d get by default with mgcv and s(), but as the other models were all fully Bayesian, the TPRS model was fitted using brm() from the brms package so that all the models were comparable, ultimately being fitted in Stan. The TPRS model didn’t do a very good job of fitting the test observations when extrapolating beyond the limits of the data. I wondered if we could do any better with the B spline basis in mgcv as I knew it had extra flexibility for short extrapolation beyond the data, but I’d never really looked into how it worked or what the respective behaviour was.

If you want to recreate elements of the rest of the post, you’ll need the following packages installed:

## Packages
library('ggplot2')
library('tibble')
library('tidyr')
library('dplyr')
library('mgcv')
library('gratia')
library('patchwork')

## remotes::install_github("clauswilke/colorblindr")
library('colorblindr')
## remotes::install_github("clauswilke/relayer")
library('relayer')

The last two are used for plotting and the relayer package in particular is needed as I’m going to be using two separate colour scales on the plots. If you don’t have these installed, you can install them using the remotes package and the code in commented lines above.

The example data set used in the comparsion had been posted to the preprint’s GitHub repo, so it was easy to grab them and start playing with. To load the data into R we can use

load(url("https://bit.ly/gprocdata"))
ls()

[1] "f_true"

where the Bitly short link just links to the .Rdata file stored on GitHub. This creates an object, f_true, in the workspace. We’ll look at the true function in a minute. Following the preprint, a data set of noisy observations is simulated from the true function by adding Gaussian noise (μ = 0, σ = 0.2)

seed <- 1234
set.seed(seed)
gp_data <- tibble(truth = unname(f_true), x = seq(-1, 1, by = 0.002)) %>%
    mutate(y = truth + rnorm(length(truth), 0, 0.2))

From that noisy set, we sample 250 observations at random, and indicate some of the observations as being in a test set that we won’t use when fitting GAMs

set.seed(seed)
r_samp <- sample_n(gp_data, size = 250) %>%
    arrange(x) %>%
    mutate(data_set = case_when(x < -0.8 ~ "test",
                                x > 0.8 ~ "test",
                                x > -0.45 & x < -0.36 ~ "test",
                                x > -0.05 & x < 0.05 ~ "test",
                                x > 0.45 & x < 0.6 ~ "test",
                                TRUE ~ "train"))

Finally we visualize the true function and the noisy observations we sampled from it

ggplot(r_samp, aes(x = x, y = y, colour = data_set)) +
    geom_line(aes(y = truth, colour = NULL), show.legend = FALSE, alpha = 0.5) +
    geom_point() +
    scale_colour_brewer(palette = "Set1", name = "Data set")

The true function and noisy observations drawn from it. The blue dots are the training observations that we’ll use to fit models, while the red dots are test observations used to investigate how the models interpolate and extrapolate.

The red points are the test observations and will be used to look at the behaviour of the splines under interpolating and extrapolating conditions.

Thin Plate splines

Firstly, we’ll look at how the thin plate splines behave under extrapolation, recreating the behaviour from the preprint. I start by fitting two GAMs where we use 50 basis functions (k = 50) from the TPRS basis (bs = “tp”). The argument m controls the order of the derivative penalty; the default is m = 2, for a second derivative penalty (penalising the curvature fo the spline). For the second model, we use m = 1, indicating a penalty on the first derivative of the TPRS, which penalises deviations from a flat function. Note that we filter the sample of noisy data to include only the training observations.

m_tprs2 <- gam(y ~ s(x, k = 50, bs = "tp", m = 2),
               data = filter(r_samp, data_set == "train"), method = "REML")
## first order penalty
m_tprs1 <- gam(y ~ s(x, k = 50, bs = "tp", m = 1),
               data = filter(r_samp, data_set == "train"), method = "REML")

I won’t worry about looking at model diagnostics in this post, and instead skip to the looking at how these two models behave when we predict beyond the limits of the training data.

Next I define some new observations to predict at from the two models

new_data <- tibble(x = seq(-1.5, 1.5, by = 0.002))

Remember the training data covered the interval -0.8–0.8, so we’re extrapolating quite far proportionally from the support of the training data. Now we can predict from the two models

p_tprs2 <- as_tibble(predict(m_tprs2, new_data, se.fit = TRUE)) %>%
    rename(fit_tprs_2 = fit, se_tprs_2 = se.fit)
p_tprs1 <- as_tibble(predict(m_tprs1, new_data, se.fit = TRUE)) %>%
    rename(fit_tprs_1 = fit, se_tprs_1 = se.fit)

Note we have named the two columns of data with some information that we’ll need for plotting, so the underscores are important.

Next we do some data wrangling to get the predictions into a tidy format suitable for plotting

crit <- qnorm((1 - 0.89) / 2, lower.tail = FALSE)
new_data_tprs <- bind_cols(new_data, p_tprs2, p_tprs1) %>%
    pivot_longer(fit_tprs_2:se_tprs_1, names_sep = '_',
                 names_to = c('variable', 'spline', 'order')) %>%
    pivot_wider(names_from = variable, values_from = value) %>%
    mutate(upr_ci = fit + (crit * se), lwr_ci = fit - (crit * se))

The basic idea here is that we cast the data to a very general long-and-thin version and pull out variables indicating the type of value (fit = fitted and se = standard error), the type of spline, and the order of the penalty, by splitting on the underscore in each of the input columns. Then we cast the long-and-thin data frame to a slightly wider version where we have access to the fit and se variables, before calculating a 89% credible interval on the predicted values.

Now we can plot the data plus the predicted values

ggplot(mapping = aes(x = x, y = y)) +
    geom_ribbon(data = new_data_tprs,
                mapping = aes(ymin = lwr_ci, ymax = upr_ci, x = x,
                              fill = order),
                inherit.aes = FALSE, alpha = 0.2) +
    geom_point(data = r_samp, aes(colour = data_set)) +
    geom_line(data = new_data_tprs, aes(y = fit, x = x, colour2 = order),
              size = 1) %>%
    rename_geom_aes(new_aes = c("colour" = "colour2")) +
    scale_colour_brewer(palette = "Set1", aesthetics = "colour",
                        name = "Data set") +
    scale_colour_OkabeIto(aesthetics = "colour2", name = "Penalty") +
    scale_fill_OkabeIto(name = "Penalty") +
    coord_cartesian(ylim = c(-2, 2)) +
    labs(title = "Extrapolating with thin plate splines",
         subtitle = "How behaviour varies with derivative penalties of different order")

Posterior predictive means for the two thin plate regression spline models showing the interpolation and extrapolation behaviour with first and second derivative penalties.

With the default, second derivative penalty we see that under extrapolation, the spline exhibits linear behaviour. For the first deriavtive penalty model, the behaviour is to predict a constant value. The credible intervals are also unrealistically narrow in the case of the TPRS model with the first derivative penalty. Neither does a particularly good job of estimating any of the test samples outside the range of x in the training data. The models do better when interpolating, except for the section around x = 0.5.

B splines

OK. What about B splines? With the B spline constructor in mgcv we have a lot of control over how we set up the basis and the wiggliness penalty. We’ll look at more of these options later, but first, we’ll look at the default behaviour where the penalty only operates over the range of the training observations.

m_bs_default <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 2)),
                    data = filter(r_samp, data_set == "train"), method = "REML")

Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients

Here we asked for a cubic B spline with a second order penalty — this is your common or garden cubic B spline where the wigglines penalty over covers the range of x in the training data. Ignore the warning; this is just because we have many functions and some aren’t supported by any of the data because of the holes due to the test observations.

If we want to have the penalty extend some way beyond the range of x, we need to pass in a set of end points over which knots will be defined. We need to specify the two extreme end points that enclose the region we want to predict over, and two interior knots that cover the range of the data, plus a little. We specify these knots below

knots <- list(x = c(-2, -0.9, 0.9, 2))

and then pass knots to the knots argument when fitting the model

m_bs_extrap <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 2)), method = "REML",
                   data = filter(r_samp, data_set == "train"), knots = knots)

Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients

The only difference here is how we have specified we want the penalty to extend away from the limits of the training observations. You’ll get another warning here. This will always happen when you set outer knots beyond the range of the data; it is harmless.

We can visualize the differences in the bases using basis() from the gratia package

bs_default <- basis(s(x, k = 50, bs = "bs", m = c(3, 2)), knots = knots,
                    data = filter(new_data, x >= -0.8 & x <= 0.8))
bs_extrap <- basis(s(x, k = 50, bs = "bs", m = c(3, 2)), knots = knots,
                   data = new_data)
lims <- lims(x = c(-1.5, 1.5))
vlines <- geom_vline(data = tibble(x = c(-0.8, 0.8)),
                     aes(xintercept = x), lty = "dashed")
(draw(bs_default) + lims + vlines) / (draw(bs_extrap) + lims + vlines) +
    plot_annotation(tag_levels = 'A')

Cubic B spline bases with knots covering the range of training observations (A) and with outer knots covering the range of the training data plus the region where we want to extrapolate. Using the outer knots has the effect of extending the wigliness penalty over the region we want to predict for. The dashed lines are drawn at x = -0.8 and x = 0.8, the limits of the training observations.

Technically, the basis functions in the top panel would extend a little into the prediction region, but basis() can’t yet handle using one data set to set up the basis and another at which to evaluate it. Because we have basis functions extending over the interval for prediction, the wiggliness penalty can apply in this region too.

Now we predict from both the models as before and repeat the data wrangling

p_bs_default <- as_tibble(predict(m_bs_default, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_default = fit, se_bs_default = se.fit)
p_bs_extrap <- as_tibble(predict(m_bs_extrap, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_extrap = fit, se_bs_extrap = se.fit)

new_data_bs_eg <- bind_cols(new_data, p_bs_default, p_bs_extrap) %>%
    pivot_longer(fit_bs_default:se_bs_extrap, names_sep = '_',
                 names_to = c('variable', 'spline', 'penalty')) %>%
    pivot_wider(names_from = variable, values_from = value) %>%
    mutate(upr_ci = fit + (crit * se), lwr_ci = fit - (crit * se))

The only difference here is that I encoded in the variable names whether we used the default penalty or the one extended beyond the limits of the data. We plot the fits with

ggplot(mapping = aes(x = x, y = y)) +
    geom_ribbon(data = new_data_bs_eg,
                mapping = aes(ymin = lwr_ci, ymax = upr_ci, x = x, fill = penalty),
                inherit.aes = FALSE, alpha = 0.2) +
    geom_point(data = r_samp, aes(colour = data_set)) +
    geom_line(data = new_data_bs_eg, aes(y = fit, x = x, colour2 = penalty),
              size = 1) %>%
    rename_geom_aes(new_aes = c("colour" = "colour2")) +
    scale_colour_brewer(palette = "Set1", aesthetics = "colour", name = "Data set") +
    scale_colour_OkabeIto(aesthetics = "colour2", name = "Penalty") +
    scale_fill_OkabeIto(name = "Penalty") +
    coord_cartesian(ylim = c(-2, 2)) +
    labs(title = "Extrapolating with B splines",
         subtitle = "How behaviour varies when the penalty extends beyond the data")

Posterior predictive means for the two B spline models showing the interpolation and extrapolation behaviour when the penalty over covers the range of the data and when it extends byond that range.

As both these models used second derivative penalties, they both extrapolate linearlly beyond the range of the training observations. Importantly however, we get very different behaviour of the credible intervals, especially at the low end of x, where the wide interval is a better representation of the uncertainty that we have in the extrapolated predictions. This is better behaviour, as at least we’re being honest about the uncertainty when extrapolating.

Comparing different bases

So far, so uninteresting. Before we get to the good stuff and demonstrate other features of the B spline basis in mgcv, let’s just quickly compare the TPRS and B spline models with a Gaussian process smooth that is designed to closely match the data generating function. Note that this GP is fitted using mgcv where we have to specify the length scale, and as such isn’t meant to be directly comparable with either the exact or the low-rank GP models of Riutort-Mayol et al. (2020).

In mgcv a GP can be fit using bs = “gp”. When we do this, the meaning of the m argument changes. Here we are asking for a Matérn covariance function with ν = 3/2 and length scale of 0.15. These values were chosen to match those of the true function.

m_gp <- gam(y ~ s(x, k = 50, bs = "gp", m = c(3, 0.15)),
            data = filter(r_samp, data_set == "train"), method = "REML")

Again we have some wrangling to do to pull all these together into an object we can plot easily

p_bs <- as_tibble(predict(m_bs_extrap, new_data, se.fit = TRUE)) %>%
    rename(fit_bs = fit, se_bs = se.fit)
p_tprs <- as_tibble(predict(m_tprs2, new_data, se.fit = TRUE)) %>%
    rename(fit_tprs = fit, se_tprs = se.fit)
p_gp <- as_tibble(predict(m_gp, new_data, se.fit = TRUE)) %>%
    rename(fit_gp = fit, se_gp = se.fit)

new_data_bases <- bind_cols(new_data, p_tprs, p_bs, p_gp) %>%
    pivot_longer(fit_tprs:se_gp, names_sep = '_',
                 names_to = c('variable', 'spline')) %>%
    pivot_wider(names_from = variable, values_from = value) %>%
    mutate(upr_ci = fit + (2 * se), lwr_ci = fit - (2 * se))

And finally we plot using

ggplot(mapping = aes(x = x, y = y)) +
    geom_ribbon(data = new_data_bases,
                mapping = aes(ymin = lwr_ci, ymax = upr_ci, x = x, fill = spline),
                inherit.aes = FALSE, alpha = 0.2) +
    geom_point(data = r_samp, aes(colour = data_set)) +
    geom_line(data = new_data_bases, aes(y = fit, x = x, colour2 = spline),
              size = 1) %>%
    rename_geom_aes(new_aes = c("colour" = "colour2")) +
    scale_colour_brewer(palette = "Set1", aesthetics = "colour", name = "Data set") +
    scale_colour_OkabeIto(aesthetics = "colour2", name = "Basis") +
    scale_fill_OkabeIto(name = "Basis") +
    coord_cartesian(ylim = c(-2, 2)) +
    labs(title = "Extrapolating with splines",
         subtitle = "How behaviour varies with different basis types")

Warning: Ignoring unknown aesthetics: colour2

Posterior predictive means for three GAMs; a thin plate spline with 2nd derivative penalty, a B spline with 2nd derivative penalty extended over the interval for prediction, and a Gaussian process with a Matérn(ν = 3/2) covariance function with length scale = 0.15

Clearly the GP gets closer to the test data when extrapolating, but that’s not really a fair comparison as I told the model what the correct length scale was! We could try to estimate that from the data, by fitting models over a grid of likely values for the length scale parameter and using the model with the lowest REML score, but I won’t show how to do that here; I have example code in the supplements for Simpson (2018) showing how to do this is you’re keen.

More with B splines

We’re not restricted to using the second derivative penalty with B splines; we can use third, second, first or even zeroth order penalties with cubic B splines. How does their behaviour vary when interpolating and extrapolating?

For convenience I’ll just fit all three models with a common format, even though we’ve already seen and fitted the first model with the second derivative penalty. Notice how we specify the order of the derivative penalty by passing a second value to the argument m; m = 1 is a first derivative penalty, m = 0 a zeroth derivative penalty, etc.

m_bs_2 <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 2)), method = "REML",
              data = filter(r_samp, data_set == "train"), knots = knots)
m_bs_1 <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 1)), method = "REML",
              data = filter(r_samp, data_set == "train"), knots = knots)
m_bs_0 <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 0)), method = "REML",
              data = filter(r_samp, data_set == "train"), knots = knots)

Again we repeat the data wrangling need to get something we can plot

p_bs_2 <- as_tibble(predict(m_bs_2, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_2 = fit, se_bs_2 = se.fit)
p_bs_1 <- as_tibble(predict(m_bs_1, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_1 = fit, se_bs_1 = se.fit)
p_bs_0 <- as_tibble(predict(m_bs_0, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_0 = fit, se_bs_0 = se.fit)

new_data_order <- bind_cols(new_data, p_bs_2, p_bs_1, p_bs_0) %>%
    pivot_longer(fit_bs_2:se_bs_0, names_sep = '_',
                 names_to = c('variable', 'spline', 'order')) %>%
    pivot_wider(names_from = variable, values_from = value) %>%
    mutate(upr_ci = fit + (2 * se), lwr_ci = fit - (2 * se))

Note again how I’m defining the names of the columns containing fitted values and their standard errors to make it easy to pull out this data during the pivot_longer() step.

We plot the predicted values with

ggplot(mapping = aes(x = x, y = y)) +
    geom_ribbon(data = new_data_order,
                mapping = aes(ymin = lwr_ci, ymax = upr_ci, x = x, fill = order),
                inherit.aes = FALSE, alpha = 0.2) +
    geom_point(data = r_samp, aes(colour = data_set)) +
    geom_line(data = new_data_order, aes(y = fit, x = x, colour2 = order),
              size = 1) %>%
    rename_geom_aes(new_aes = c("colour" = "colour2")) +
    scale_colour_brewer(palette = "Set1", aesthetics = "colour", name = "Data set") +
    scale_colour_OkabeIto(aesthetics = "colour2", name = "Penalty") +
    scale_fill_OkabeIto(name = "Penalty") +
    coord_cartesian(ylim = c(-2, 2)) +
    labs(title = "Extrapolating with B splines",
         subtitle = "How behaviour varies with penalties of different order")

Posterior predictive means for three GAMs using B splines with different orders of derivative penalty, all covering the region where we want to predict for the test samples; a B spline with 2nd derivative penalty, a B spline with 1st derivative penalty, and a B spline with zeroth derivative penalty.

The plot shows the different penalties leading to quite a wide range of behaviour. The spline with the zeroth order penalty interpolates poorly, seemingly heading towards the overall mean of the data during each of the test section within the range of x. When extrapolating, we again see this “mean reversion” behaviour, which means it does well when extrapolating for large values of x, but it does extremely poorly at the low end of x. The credible intervals for this model are also unrealistically narrow, like those of the TPRS model with 1st derivative penalty that we saw earlier on.

The model with the first derivatve penalty has reaonable behaviour; it extrapolates as largely a flat function continuing from the min and maximum values of x, as with the TPRS fit with a first derivative penalty we saw above, but the credible intervals are much more realistic for the B spline than for the TPRS. Note also that the intervals for the B spline with the first derivative penalty don’t explode as quickly as those for the B spline fit with the second derivative penalty.

Multiple penalties

One final trick that the B spline basis in mgcv has up its sleve is that you can combine multiple penalties in a single spline. We could fit cubic B splines with one, two, three, or even four penalties. The additional penalties are specified by passing more values to m: m = c(3, 2, 1) would be a cubic B spline with both a second derivative and a first derivative penalty, while m = c(3, 2, 1, 0) would get you a cubic spline with all three penalties. You can mix and match as much as you like with a couple of exceptions:

you can only have one penalty for each order, so no, you can’t penalise one of the derivative more stringly by adding more than one penalty for it; m = c(3, 2, 2, 1) for example isn’t allowed, and
you can only have values for m[i] (where i > 1) that exist for the given order of B spline, i.e. where m[i] ≤ m[1].

In the code below I fit two additional models with mixtures of penalties, and then compare these with the default second derivative penalty (fitted earlier). In each case, I’m again using the knots argument to extend the penalties over the range we might want to predict over.

m_bs_21 <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 2, 1)), method = "REML",
                data = filter(r_samp, data_set == "train"), knots = knots)

Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients

m_bs_210 <- gam(y ~ s(x, k = 50, bs = "bs", m = c(3, 2, 1, 0)), method = "REML",
                data = filter(r_samp, data_set == "train"), knots = knots)

Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients

Again, we do the same wrangling, this time encoding the mixtures of orders in the column names

p_bs_21 <- as_tibble(predict(m_bs_21, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_21 = fit, se_bs_21 = se.fit)
p_bs_210 <- as_tibble(predict(m_bs_210, new_data, se.fit = TRUE)) %>%
    rename(fit_bs_210 = fit, se_bs_210 = se.fit)

new_data_multi <- bind_cols(new_data, p_bs_2, p_bs_21, p_bs_210) %>%
    pivot_longer(fit_bs_2:se_bs_210, names_sep = '_',
                 names_to = c('variable', 'spline', 'order')) %>%
    pivot_wider(names_from = variable, values_from = value) %>%
    mutate(upr_ci = fit + (2 * se), lwr_ci = fit - (2 * se),
           penalty = case_when(order == "2" ~ "2",
                               order == "21" ~ "2, 1",
                               order == "210" ~ "2, 1, 0"))

The last step here uses case_when() to write out nicer formatting for the penalties, so we get a nicer legend on the plot, which we produce with

ggplot(mapping = aes(x = x, y = y)) +
    geom_ribbon(data = new_data_multi,
                mapping = aes(ymin = lwr_ci, ymax = upr_ci, x = x, fill = penalty),
                inherit.aes = FALSE, alpha = 0.2) +
    geom_point(data = r_samp, aes(colour = data_set)) +
    geom_line(data = new_data_multi, aes(y = fit, x = x, colour2 = penalty),
              size = 1) %>%
    rename_geom_aes(new_aes = c("colour" = "colour2")) +
    scale_colour_brewer(palette = "Set1", aesthetics = "colour", name = "Data set") +
    scale_colour_OkabeIto(aesthetics = "colour2", name = "Penalty") +
    scale_fill_OkabeIto(name = "Penalty") +
    coord_cartesian(ylim = c(-2, 2)) +
    labs(title = "Extrapolating with B splines",
         subtitle = "How behaviour changes when combining multiple penalties")

Posterior predictive means for three GAMs using B splines with mixtures of derivative penalties, all covering the region where we want to predict for the test samples; a B spline with single 2nd derivative penalty, a B spline with a 2nd and 1st derivative penalties, and a B spline with 2^nd, 1^st and 0^th derivative penalties.

By mixing the penalties, we mix some of the behaviour features. For example, the weird interpolation behaviour of the B spline with zeroth derivative penalty is essentially removed when combined with second and first derivative penalties.

Given the data, the predictions that essentially predict constant functions beyond the range of the data, but with wide credible intervals are probably the most realistic; in each case where we used a B spline that included a first derivative penalty has at least covered most of the test observation beyond the range of x.

However, in none of the fits do we get behaviour that get close to fitting the test observations beyond the training of x in the training data, even when using a Gaussian process that supposedly matches at least the general form of the true function.

References

Riutort-Mayol, G., Bürkner, P.-C., Andersen, M. R., Solin, A., and Vehtari, A. (2020). Practical hilbert space approximate bayesian gaussian processes for probabilistic programming. Available at: http://arxiv.org/abs/2004.11408.

Simpson, G. L. (2018). Modelling palaeoecological time series using generalised additive models. Frontiers in Ecology and Evolution 6, 149. doi:10.3389/fevo.2018.00149.

gratia 0.4.1 released

2020-05-31T15:00:00+02:00

After a slight snafu related to the 1.0.0 release of dplyr, a new version of gratia is out and available on CRAN. This release brings a number of new features, including differences of smooths, partial residuals on partial plots of univariate smooths, and a number of utility functions, while under the hood gratia works for a wider range of models that can be fitted by mgcv.

Partial residuals

The draw() method for gam() and related models produces partial effects plots. plot.gam() has long had the ability to add partial residuals to partial plots of univariate smooths, and with the latest release draw() can now do so too.

df1 <- data_sim("eg1", n = 400, seed = 42)
m1 <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df1, method = "REML")
draw(m1, residuals = TRUE)

Partial plots of estimated smooth functions with partial residuals

If the estimated functions have the correct degree of wiggliness, the partial residuals should be approximately uniformly distributed about the estimated smooth.

Simulating data

The previous example demonstrated another new feature of the latest release; data_sim(). This is a reimplementation of mgcv::gamSim(), which is used to simulate data for testing GAMs. Data can be simulated from several widely-used functions that illustrate the power an capabilities of estimating smooth functions using penalised splines.

data_sim() returns simulated data in a tidy fashion and all the various example test data sets return consistently. Also, data from the example functions can be simulated from a number of probability distributions — currently the Gaussian, Poisson, and Bernoulli distributions are supported, but future versions will offer a wider range to simulate from.

For example, the response data modelled above came from the following four functions used by Gu and Wahba

df1 %>% mutate(id = seq_len(nrow(df1))) %>%
  select(id, x0:x3, f0:f3) %>%
  pivot_longer(x0:f3, names_sep = 1, names_to = c("var", "fun")) %>%
  pivot_wider(names_from = var, values_from = value) %>%
  ggplot(aes(x = x, y = f)) + 
    geom_line() + 
    facet_wrap(~ fun)

Gu and Wahba four term additive example functions

Difference smooths

When GAMs contain smooth-factor interactions, we often want to compare smooths between levels of the factor to determine how the smooth effects vary between groups. The new release contains a function difference_smooths() that implements this idea.

The mgcv example for factor-smooth interactions using the by mechanism can be simulated from using data_sim(). The model fitted to the data contains a smooth of covariate x1 and a smooth of x2 for each level of the factor fac. Note that we need the parametric effect for fac as the by smooths are all centred about 0; the parametric term models the different group means.

df <- data_sim("eg4", n = 1000, seed = 42)
m2 <- gam(y ~ fac + s(x2, by = fac) + s(x0), data = df, method = "REML")

difference_smooths() returns differences between the smooth functions for all pairs of the levels of fac, plus a credible interval for the difference.

sm_diffs <- difference_smooths(m2, smooth = "s(x2)")
sm_diffs

# A tibble: 300 x 9
   smooth by    level_1 level_2  diff    se   lower upper      x2
   <chr>  <chr> <chr>   <chr>   <dbl> <dbl>   <dbl> <dbl>   <dbl>
 1 s(x2)  fac   1       2       0.797 0.536 -0.253   1.85 0.00170
 2 s(x2)  fac   1       2       0.846 0.500 -0.135   1.83 0.0118 
 3 s(x2)  fac   1       2       0.896 0.467 -0.0190  1.81 0.0219 
 4 s(x2)  fac   1       2       0.945 0.435  0.0929  1.80 0.0319 
 5 s(x2)  fac   1       2       0.994 0.405  0.200   1.79 0.0420 
 6 s(x2)  fac   1       2       1.04  0.378  0.302   1.78 0.0521 
 7 s(x2)  fac   1       2       1.09  0.354  0.397   1.78 0.0622 
 8 s(x2)  fac   1       2       1.14  0.332  0.485   1.79 0.0722 
 9 s(x2)  fac   1       2       1.18  0.314  0.566   1.80 0.0823 
10 s(x2)  fac   1       2       1.22  0.298  0.641   1.81 0.0924 
# … with 290 more rows

There is a draw() method for objects returned by difference_smooths(), which will plot the pairwise differences

draw(sm_diffs)

Differences between estimated smooth functions

Note that these differences exclude differences in the group means and the differences between smooths are computed on the scale of the link function. A future version will allow for differences that include the group means.

Fitted values and residuals utility functions

Two new utility functions are in the current release, add_fitted() and add_residuals() add fitted values and residuals to a data frame of observations used to fit a model.

df1 %>% add_fitted(m1, value = ".fitted") %>%
  add_residuals(m1, value = ".resid")

# A tibble: 400 x 12
       y    x0     x1    x2    x3     f    f0    f1     f2    f3 .fitted .resid
   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl>  <dbl>
 1  2.99 0.915 0.0227 0.909 0.402  1.62 0.529  1.05 0.0397     0    2.57  0.419
 2  4.70 0.937 0.513  0.900 0.432  3.25 0.393  2.79 0.0630     0    3.91  0.788
 3 13.9  0.286 0.631  0.192 0.664 13.5  1.57   3.53 8.41       0   12.9   1.03 
 4  5.71 0.830 0.419  0.532 0.182  6.12 1.02   2.31 2.79       0    6.57 -0.859
 5  7.63 0.642 0.879  0.522 0.838 10.4  1.80   5.80 2.76       0   10.3  -2.67 
 6  9.80 0.519 0.108  0.160 0.917 10.4  2.00   1.24 7.18       0    9.23  0.571
 7 10.4  0.737 0.980  0.520 0.798 11.3  1.47   7.10 2.75       0   11.2  -0.754
 8 12.8  0.135 0.265  0.225 0.503 11.4  0.821  1.70 8.90       0   11.0   1.77 
 9 13.8  0.657 0.0843 0.282 0.254 11.1  1.76   1.18 8.20       0   11.5   2.28 
10  7.51 0.705 0.386  0.504 0.667  6.50 1.60   2.16 2.74       0    6.71  0.792
# … with 390 more rows

Other changes

This release contains a number of other less-visible changes. gratia now handles models fitted by gamm4::gamm4() in more functions than before, while the utility functions link() and inv_link() now work for all families in mgcv, including the general family functions and those used for fitting location scale models.

Rendering your README with GitHub Actions

2020-04-30T22:30:00+02:00

There’s one thing that has bugged me for a while about developing R packages. We have all these nice, modern tools we have for tracking our code, producing web sites from the roxygen documentation, an so on. Yet for every code commit I make to the master branch of a package repo, there’s often two or more additional steps I need to take to keep the package README.md and pkgdown site in sync with the code. Don’t get me wrong; it’s amazing that we have these tools available to help users get to grips with our R packages. It’s just that there’s a lot of extra things to remember to do to keep everything up to date. The development of free-to-use services such as Travis CI or Appveyor have been very useful as they can automate many of these repetitive tasks. A more recent newcomer to the field is GitHub Actions. The other day I was grappling with getting a GitHub Actions workflow to render a README.Rmd file to README.md on GitHub, so that I didn’t have to do it locally all the time. After a lot of trial and error, this is how I got it working.

The general use case I am imagining here is the package author that has a README.Rmd file that contains R code chunks, which they want to render to README.md so it will get displayed nicely on GitHub. You might want to do this to provide a simple overview of how to use some key functionality of your package or show off a plot or two that can be generated by the package. It’s pretty easy to render this locally with a Makefile or by simply invoking the correct R incantation directly in the terminal. However, wouldn’t it be great if we could automate this!

The first step in getting this working was to recognise that the R Infrastructure organisation has been working to make R-related GitHub Actions workflows available to users. This effort has been lead by Jim Hester and Jim has very helpfully provided a workflow example YAML file showing how one might go about rendering a README.Rmd file to README.md using the rmarkdown package.

Also, the usethis package has made it incredibly easy to get started using GitHub Actions; usethis provides use_github_actions() to set your package up to start using GitHub Actions to check your package builds without errors. There’s also a use_github-action() function that can add workflows from the r-lib/actions repo to your package.

If you don’t have usethis installed, install it (install.packages(“usethis”)), then you can set your R package repo up to run R CMD check on your package on GitHub’s servers by running

usethis::use_github_actions()

in an R session in the package root folder. Running this will produce something like this

> usethis::use_github_actions()
✔ Setting active project to '/home/gavin/work/git/gratia/gratia'
✔ Creating '.github/'
✔ Adding '*.html' to '.github/.gitignore'
✔ Creating '.github/workflows/'
✔ Writing '.github/workflows/R-CMD-check.yaml'
● Copy and paste the following lines into '/home/gavin/work/git/gratia/gratia/README.md':
  <!-- badges: start -->
  [![R build status](https://github.com/gavinsimpson/gratia/workflows/R-CMD-check/badge.svg)](https://github.com/gavinsimpson/gratia/actions)
  <!-- badges: end -->

which outlines the steps usethis has taken on your behalf. The last line prints out some text that you can paste into the README.Rmd to show a status badge for the GitHub Action; in this case it will show whether or not your package passed R CMD check without error.

This also nicely illustrates how you might set things up by hand of course, especially if you don’t want to run R CMD check on each push.

GitHub Actions workflows are configurations that describe the steps in the workflow and are stored in YAML files. These files should be located in a .github/workflows folder in the package root. If all you want to do is render a README.Rmd to README.md you could just as easily create this folder yourself. I’m not sure why usethis also creates a .gitignore containing ’*.html’ in the .github folder, but if this is needed for what you’re doing, go ahead and create it too.

To get set-up quickly to render README.Rmd to markdown, you can now use use_github_action(“render-readme.yaml”). This will copy the render-readme.yaml file from r-lib/actions/examples to .github/workflows/render-readme.yaml. Alternatively, you can touch .github/workflows/render-readme.yaml and add what you need by hand.

This is what the contents of render-readme.yaml look like, at the time of writing, if you used usethis to create it:

on:
  push:
    paths:
      - README.Rmd

name: Render README

jobs:
  render:
    name: Render README
    runs-on: macOS-latest
    steps:
      - uses: actions/checkout@v2
      - uses: r-lib/actions/setup-r@v1
      - uses: r-lib/actions/setup-pandoc@v1
      - name: Install rmarkdown
        run: Rscript -e 'install.packages("rmarkdown")'
      - name: Render README
        run: Rscript -e 'rmarkdown::render("README.Rmd")'
      - name: Commit results
        run: |
          git commit README.md -m 'Re-build README.Rmd' || echo "No changes to commit"
          git push origin || echo "No changes to commit"

The first bit under on: controls when the workflow is triggered. The way the example workflow is set up means it will only be triggered if a file matching the path README.Rmd is included in the commit when pushed to the repo. It’s also worth noting that until the workflow is actually triggered, it won’t show up in the Actions tab in your repo on GitHub — this caused me no end of grief until I figured our this GitHub Actions feature. To trigger this workflow, you need to edit README.Rmd, add and commit those changes using git, and then push the changes to GitHub.

That didn’t suit my use case however; what if I change the package code in such a way that any output or plots produced by code in the README.Rmd would also change? In this case, I would have to needlessly tweak something in README.Rmd and push that change just to trigger rendering.

There’s probably a better way to do this — such as setting paths: to a wildcard that would match any .R file in the R folder so the workflow would be triggered on any change to the package code — but to just get something up and running I changed the on: part to read:

on:
  push:
    branches: master

which indicates that the workflow should run for any push to the master branch of the repo.

The top-level name: element is how your workflow will be listed in the Actions tab in your repo. Set this to something short but descriptive so it is easy to filter the various outputs from workflows that are run on the GitHub Actions service.

All workflows contain one or more jobs, listed under the jobs: element. In the example YAML file, there is a single job listed as render:, which has a name, Render README.

The runs-on element indicates what system the job will be run on; here is is a Mac OS system. I’m not sure why the r-lib/actions example workflows all run on Mac OS systems? Anyway, they work, so no need to change that unless you need something specific.

The steps: section is where the stages of the job are defined.

    steps:
      - uses: actions/checkout@v2
      - uses: r-lib/actions/setup-r@v1
      - uses: r-lib/actions/setup-pandoc@v1

Each of the uses: elements pulls in some pre-existing worfklow steps that you can build to upon to bootstrap the solution you need. For example, the actions/checkout@v2 workflow contains everything you need to checkout your repo and make it available to the current job. This is pretty fundamental; unless the GitHub Actions service can get at the code in your repo, it won’t be able to do anything useful whatsoever.

The next two uses: are workflows provide by r-lib/actions that set up a working R installation (r-lib/actions/setup-r@v1) and the Pandoc library used by rmarkdown (r-lib/actions/setup-pandoc@v1).

After the uses: declarations, the YAML file includes a series of steps that describe commands that are run on the service. This is where the real action takes place.

      - name: Install rmarkdown
        run: Rscript -e 'install.packages("rmarkdown")'
      - name: Render README
        run: Rscript -e 'rmarkdown::render("README.Rmd")'
      - name: Commit results
        run: |
          git commit README.md -m 'Re-build README.Rmd' || echo "No changes to commit"
          git push origin || echo "No changes to commit"

Here we see three sets of commands that will be run

the first installs the rmarkdown package,
the second runs rmarkdown::render() on README.Rmd to render it, and
the third commits the rendered README.md file and pushes it to your repo, or echos a comment if no changes are needed.

Notice how the run: element for the last step has a | after run:. This indicates that this particular step involves multiple lines of commands to be executed one after another.

If you’ve not come across Rscript before, it’s a way to use R like a scripting language, non-interactively. Here we’re using the -e flag to tell Rscript what R code to run, rather than passing it a .R to run.

Out of the box, these steps aren’t going to be very useful for R package maintainers if the README.Rmd uses anything other than the base R installation and recommended packages. At the very least you are going to want to also install the R package you are documenting in the README.Rmd, plus any other packages you need for the Rmd that might not be dependencies of the package in the repo.

In my case, I just needed to install the gratia package alongside rmarkdown, so I changed that run: element to be

      - name: Install rmarkdown
        run: Rscript -e 'install.packages(c("rmarkdown", "gratia"))'

I also decided to change the rmarkdown::render() call; by default this will generate HTML output by rendering the .Rmd first to .md and thence to .html. As we don’t need this latter step, I changed the output_format argument of render() to be “md_document”, so that element now looks like this

      - name: Render README
        run: Rscript -e 'rmarkdown::render("README.Rmd", output_format = "md_document")'

Doing this means I don’t also generate a README.html file (which might be why the .gitignore was created by usethis earlier?); keeping the .gitignore can’t hurt given that it only excludes any .html files from a commit, so I left it alone.

I also modified the commit step too. The default assumes you already have a README.md in the repo and that this is the only file you want to add to the commit. If you render any plots in the .Rmd, then you’ll also want to add those to the commit. So, I added an explicit git add line prior to the commit, and also simplified the latter

      - name: Commit results
        run: |
          git add README.md man/figures/README-*
          git commit -m 'Re-build README.Rmd' || echo "No changes to commit"
          git push origin || echo "No changes to commit"

As you can see, I used a wildcard to catch any figures created by the render. In the README.Rmd I used a setup chunk to set the fig.path knitr option so that any plots were generated in the man/figures folder and had the prefix README- prepended to the file name:

knitr::opts_chunk$set(
  fig.path = "man/figures/README-"
)

The man/figures folder is a useful place to store figures generated like this as they’ll be carried along with your R package and available on CRAN, where the README.md file is also displayed if present. This folder is also used if you generate and include figures in the package documentation using roxygen, for example.

I used the prefix README- so that I could limit what I was adding in the git add step of the workflow. I’m always a bit nervous when staging files for a commit and never use git commit -a for example. This way I have a reasonable means of only adding plots that were created by rendering README.Rmd.

After these changes (and a few others as I was troubleshooting some issues) my workflow to render README.Rmd files looks like this

name: render readme

# Controls when the action will run
on:
  push:
    branches: master

jobs:
  render:
    # The type of runner that the job will run on
    runs-on: macOS-latest

    steps:
    # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
    - uses: actions/checkout@v2
    - uses: r-lib/actions/setup-r@v1
    - uses: r-lib/actions/setup-pandoc@v1

    # install packages needed
    - name: install required packages
      run: Rscript -e 'install.packages(c("rmarkdown","gratia"))'

    # Render READEME.md using rmarkdown
    - name: render README
      run: Rscript -e 'rmarkdown::render("README.Rmd", output_format = "md_document")'

    - name: commit rendered README
      run: |
        git add README.md man/figures/README-*
        git commit -m "Re-build README.md" || echo "No changes to commit"
        git push origin master || echo "No changes to commit"

This is a first pass at getting something working; it’s just occurred to me that the git add line probably needs to be linked with the git commit line so it only tries to commit if files were staged with git add.

It would also be good to try to cache the installed packages so the workflow doesn’t need to install everything for rmarkdown and gratia every time it is run. There’s an example of caching packages in the pkgdown action r-lib/actions/examples/pkgdown.yaml. However, I was running into issues related to the R 4.0.0 release and packages in the cache not getting refreshed even though they were out of date. So I removed that step from my pkgdown.yaml workflow, and as a result didn’t try to implement it for rendering README.Rmd files. Yet anyway…

For reference the workflow takes between two and three minutes to run on GitHub, even without package caching, which isn’t too bad, but rendering the README.Rmd locally takes only a few seconds, so there’s lots to be gained here by figuring out a reliable caching mechanism.

If you have implemented something similar for a GitHub Actions workflow, let me know in the comments below; this is all quite new to me and I’m interested in how other people might have tackled this. Now that I have this working reliably I only need to remember to git pull from GitHub more often to get the changes to README.md. The next issue I want to look at is getting the right paths: settings so the README.Rmd is rendered only when relevant files are changed in the package, not on every push to the repo.

Lastly, a big thank you to Jim Hester and everyone else who’s contributed to the R-related GitHub Actions workflows. This is an amazingly useful service for the R Community, and I for one am incredibly thankful that we have such helpful and knowledgeable people among us that are doing all this great work to make developing R packages that much easier.

What evaluating Discovery Grants for the last three years has taught me

2020-02-26T00:00:00+01:00

For the last three years I have been a member of NSERC’s Discovery Grant Evaluation Group for Ecology and Evolution (that’s 1503 in NSERC-speak). In that time I’ve evaluated over 130 Discovery Grant submissions, read the same number of Canadian CCVs, and even chaired a few evaluations. This is what I learned, through this process, about writing a successful Discovery Grant.

Discovery Grants (hereafter DGs) are an odd fish; they’re programme grants, not projects, intended to fund the next five years of an applicant’s research programme in the natural sciences or engineering. They are framed around a few short-term objectives against which streams of activity are proposed to address the long-term goals of the research programme. They describe activities that will be completed by Highly Qualified Personal (HQP) — NSERC speak for basically anyone that receives training from the applicant that isn’t leading their own research programme — and the environment in which and philosophy by which that training will take place. Finally, DGs have relatively high success rates — around 60% depending on what group of applicants you fall into — but are typically of low monetary amounts¹, and as such applicants will rarely get anywhere near the amount of money they request for the proposed work that is budgeted for in the proposal.

DGs are evaluated on three key components:

Excellence of the Researcher (EoR) — how highly is the applicant rated in terms of research excellence, accomplishments, and service?
Merit of the Proposal (MoP) — how well is the applicant’s proposed programme of work evaluated?, and
Training of HQP (confusingly just HQP) — how highly is the past training record and proposed training plan philosophy rated?

Each of these components is assigned a rating (from highest to lowest)

Exceptional
Outstanding
Very Strong
Strong
Moderate
Insufficient

Each rating is described by the Merit Indicators in what NSERC and Evaluation Group members all call “The Grid”. The Grid itself is a single sheet of paper with brief descriptions of what the Evaluation Group is looking for to assign a proposal to each rating for each of the three components described above. The Grid is supported by the Peer Review Manual, which has fuller descriptions of what Evaluation Group members are looking for when they assign ratings.

My copy of The Grid from the 2020 Discovery Grant Competition week in Ottawa, February 2020

The Grid and the merit indicators are exceptionally important in evaluating DGs. They ensure that all applicants are treated fairly and objectively in the same way. They focus Evaluation Group member’s assessments on the criteria that NSERC are interested in, not each member’s individual criteria for what makes a good proposal.

How we practically assess Discovery Grants

If you are familiar with the conference panel review system NSERC uses to evaluate DGs, you might want to skip this next section and jump to the part where I explain what we, as Evaluation Group members, are looking in a good DG.

Typically, each DG is read by five members of the Evaluation Group — known as the Readers — each of whom will ultimately provide ratings for the three assessed components. The final rating for each component is the median of the ratings from the five Readers and this final rating determines which funding bin the DG application ends up in. Evaluation Group members don’t decide how much money each DG is awarded; the dollar amounts attached to each bin are ultimately determined by the NSERC staff in the weeks after the Evaluation Group has concluded it’s activities, and depend on the available budget for each Evaluation Group.

The actual evaluation of each DG takes place during a single week in the middle of February². For 1503, we had three rooms running almost continually from 0830 to 1700 each day in which the five readers for each DG would spend 15 minutes discussing the merits of each application before voting their final ratings. Each room has a Chair — an Evaluation Group member who oversees each DG evaluation and facilitates the discussion — and an NSERC Programme Officer — who oversees the process and provides input on areas of procedure and policy and keeps the whole process on track. The Chair and the Programme Officer are there to ensure that all DGs are treated the same way; 15 mins for discussion, evaluated in terms of the The Grid, etc. From time to time, other Programme Officers, Team Leads, and other NSERC staff that oversee the different Evaluation Groups, and the overall Chair for 1503 would sit in on evaluations for periods of time to ensure fairness across the three 1503 rooms and across the the various Evaluation Groups.

The actual evaluation is a pretty frenetic affair. At the start of the 15 minute evaluation the name of the applicant is announced and the Chair asks if there are any Delays — valid delays to an applicant’s activities such as parental leave, illness, caring for a dependent, etc are taken into account when assessing EoR — and any nomination for a DAS (Discovery Accelerator Supplement). I won’t discuss DAS nominations here, but if there is a nomination the five Readers also need to discuss and vote on the DAS nomination within the 15 minute evaluation period; knowing that there is a nomination upfront ensures that the Chair leaves enough time for these additional deliberations.

Next each Reader, in turn, gives their preliminary ratings for the three components. Then the first Reader (R1) has 4–5 minutes to justify their ratings. R1 will typically hit upon the main evidence supporting their evaluation and hence has a little longer to make their case. Then R2 has a couple of minutes to explain their scores; R2 will typically focus on areas where they might differ from R1 in terms of their rating, or provide examples of additional factors justifying their own rating if they agree with R1. Usually, the Chair will then briefly intercede, identifying areas of disagreement in the preliminary ratings so that R3, R4, and R5 can focus their brief comments (typically just a minute or 90 seconds each) on any areas of disagreement.

Once each Reader has given their comments and justification, the remaining time is given over to discussing areas where the Readers might disagree on the ratings. The aim here is not to come to consensus across the five Readers, but to ensure that sufficient consideration is given to differences of opinion among the Readers. Throughout, the room Chair will be making notes and will facilitate the discussion by referring the Readers to The Grid, trying to focus attention on the specific merit indicators based on their interpretation of the language being used by the Readers. The room Chair also ensures that each Reader has a chance to speak or comment so that everyone’s voice is heard. Once we hit 13 or 14 mins, the room Chair will bring the discussion to a close and ask the Readers to vote.

Voting is done on one of about eight laptops arranged around the room, and proceeds in private and anonymously; a Reader is not required to stick to their preliminary ratings, but is free to do so if they wish and nobody, not even the Programme Officer, knows the way an individual Reader ultimately votes³. Once all the ratings have been entered, the final score (Outstanding-Strong-Strong for example for EoR-MoP-HQP) is announced by the Programme Officer. If needed, a few house-keeping activities are attended to (such as Messages to Applicants for anyone in receipt of a rating of Moderate or Insufficient on one or more component). Then the whole process starts again for the next applicant, often with one or more Readers changing up and often moving between rooms. If there is a DAS nomination, the whole evaluation described above takes place in about 11 or 12 minutes, leaving a couple of minutes for the DAS discussion and voting before the evaluation concludes.

What makes a good Discovery Grant?

Fifteen minutes doesn’t sound like a lot of time to evaluate a proposal, even one as short as a DG. It isn’t, but each Evaluation Group member will have spent the previous two months reading each of the DGs they were assigned (I had about 45 DGs to evaluate this year (2020), 4 of which were for other EGs), preparing notes to support their assessments, as well as taking part in calibration exercises. The aim of the 15 minute evaluation is to provide time for Readers to justify their ratings and consider the input of the other Readers before giving their final ratings. So, during the preceding two months, what were Readers looking for when evaluating a DG?

The advice below is just that; my advice. Nothing here is official NSERC policy or guidance. Treat what I write below accordingly…

Excellence of the Researcher

Here, Evaluation Group members are looking at the applicant and their research and service activities, plus their recognitions and accomplishments.

Readers are looking at inter alia

the publication record of the applicant; what papers the applicant has published, where, and what impact they had,
where the applicant has presented their work, to whom? Was it an invited talk or a keynote?
whether the applicant is on the editorial boards of any journals, or served on committees or served scholarly societies, organized conferences or conference sessions, workshops, or served as expert witnesses,
whether the applicant has received any recognitions, awards, etc,
the funding record of the applicant
etc.

We’re not bean-counting here; while a lot of this information is gleaned from the Canadian Common CV (CCCV), we’re trying to evaluate the quality of research outputs, not the quantity, plus the quality of the service to the research community. In this regard, the applicant can help their Readers by highlighting important outcomes of the applicant’s work and providing evidence for impact in the Most Significant Contributions section of the the proposal. Most importantly, your Readers need evidence of the excellence or impact of your contributions; if you only quote bibliometric data at us, we aren’t going to be able to weigh that properly as evidence. Citation rates vary from (sub-)field to (sub-)field and your Readers are not all going to be familiar with the field in which you work. Help them understand how great you are by giving specific examples of impact; if your paper has influenced researchers in broader fields, tell us; if your work led to a new paradigm, explain how; if your work resulted in actionable conservation management outcomes, point out where; if a contribution led to a new collaboration, invitation to give a talk, or join a committee or working group, point this out.

Readers are not just looking at research activities; service to the research community is equally important so tell your Readers about the societues you serve, the committees you joined, the activities your organized or contributed toward.

When completing your Most Significant Contributions section, bear in mind that you don’t have to give five contributions, that’s just the maximum. If you have three themes to your contributions, present this information as three groups of papers/contributions and use the space you’re given accordingly.

As Evaluation Group members, we’re conscious that author order norms are not consistent across disciplines, and that many applicants will have publication records that reflect a high degree of collaboration in their research programme. This is fine and we really do want to give you credit for your contributions, but you need to explain this to us; Evaluation Group members are not allowed to give people the benefit of the doubt about researcher contributions. If you are regularly in the middle of many authors on your papers, or routinely don’t take the senior/first author position, then tell us why and explain your contributions to these papers, otherwise we have no evidence you’re leading research or what your contribution was.

You give this extra background information in the Additional Information on Contributions section of the proposal. Use this section, giving specific examples (you can reference your CCCV papers by number, e.g. [1] or [J1] [C2] if you have papers and book chapters for example, in this section, but include a note to say what your system is so your readers know) to provide additional information on where you publish papers and why, and what your contributions were where it is not clear from typical norms (First/last author for example). You don’t have a lot of space in this section so use it well; assume your Readers know nothing about you and what author order means in terms of your contributions.

Merit of the Proposal

In my experience this is the area where many applicants do themselves few favours.

It is important to realize that some — if not most — of your Readers are not going to be subject-matter experts in the area you are writing your proposal on. All your Readers will however know what constitutes good research design, clear exposition, etc. Write your proposal section with this in mind; you’re writing for researchers but not necessarily someone in your specific sub-field of ecology or evolution.

Write clearly and concisely; use your space well.

Readers are looking for four main things. First, we’re evaluating whether the research you propose is original and innovative and what we anticipate the impact on the applicant’s (sub-)field will be. There’s even a section where you can address impact that you’re asked to add (usually at the end of the Proposal section). Don’t oversell the impact of your work; not everything is going to be paradigm changing, but you can help yourself by clearly articulating what you anticipate the impact of this work will be and why.

Second, we’re looking to see if you have described the long-term goal of your research programme; this is the thing you envision working toward over two or more DG rounds. Readers will also be looking to see if your short-term objectives are given, feasible, and how well they mesh with the long-term goal.

Short-term objectives are the things you will work on in this DG proposal. As such, Readers need to understand how the objectives will help you make progress in achieving the long-term goal of your programme. We need to see that these objectives are not just clearly described but are planned and well defined. This is where good grant writing can help; the more clearly you articulate what the short-term objective are and how you intend to achieve them, the more highly you can score on MoP. What theoretical framework are you working under or plan to develop? What are the specific hypotheses you will test? Tie this back into the impact section so we can understand how attaining your objectives will lead to impact and advances in your field/area.

The third thing we’re looking for is how well the methods you propose to use will enable you to tackle the objectives. If you are doing experiments, tell me how many samples you’ll collect, how many replicates (please don’t just say n=3 and be done with it), what treatment levels you’ll use and why those levels. If you’re doing observational work, tell me why you want to work where you propose to work, what the pressure gradient is and how you’ll measure the pressure. If you’re working with species, why those species and not others? Why this system? Why are you using this method over competing methods? How will you analyse your data? (Don’t just rattle off a list of stats methods you’ll apply!)

Think about the appropriateness of the techniques you plan to use because you will have Readers who are familiar with the methods and who will call you out if they are inappropriate or call into question whether you can achieve your objectives.

Detail helps, but it has to be balanced with the needs of other areas of the Proposal section. Use detail where needed to hit the Merit Indicators; methods should be clearly described (or clearly defined for Exceptional) and appropriate according to the grid. Try to think about what a non-expert might need to read in order to assess this.

The fourth main area is easy to resolve and doesn’t cost you any space in the Proposal section; you can’t get money from two or more sources for doing the same thing. The emphasis is on you, the applicant, to explain how what you’re asking for in the DG is distinct from other funding sources you hold or have applied for. There is a separate section, Relationship to Other Research Support, where you write to each of the grants in progress on your CCCV and explain how they differ from what you propose to do in the DG. If there is overlap explain how and demonstrated why you’re not asking for those funds; if you have funding from elsewhere to collect some data that you’ll use in support of an activity in the DG, then explain this. Perhaps you have funding for 50 samples but your DG requires 200; state you are asking for an additional 150 samples in the DG — and why you need these additional samples — and only budget for 150 in the Budget Justification section. All of this also applies to funding you have applied for but, at the time of submitting your DG application, you don’t have a decision on.

If you are holding or applying for CIHR or SSHRC grants you must declare this — there’s now a box to tick to indicate that you have or have applied for such funding — and include the required budgetary details and descriptions of the grants. If you check the button, the Research Portal shouldn’t let you submit your DG without attaching the relevant information to your DG application. Check the instructions!

This is an incredibly important point. This is one of the few areas of the evaluation where Readers can instantly decide the entire MoP rates Insufficient (and effectively scupper your grant) regardless of how groundbreaking your work proposed research will be. If there’s uncertainty, you can be sure Readers will spot it and question it, usually ahead of time so that other NSERC people can be in the room to advise the Readers in their discussions. You really don’t want your Readers debating funding overlap instead of the cool science you propose to do — take the time get this right and don’t just say there’s no overlap, explain why there isn’t!

As we’re evaluating the MoP, Readers will be looking for where the HQP you propose to train will fit in to the programme. Think carefully about the feasibility and appropriateness of the activities or projects you assign to particular HQP. If you propose to do something that requires a PhD student, don’t allocate it to an Honours student!

Here are a few more tips for things to do or avoid when writing your Proposal section:

Don’t repeat verbatim things in the Recent Progress section that you’ve already covered in the Most Significant Contributions. Make reference to the other section as needed.
Don’t spend too much space on the literature review; Readers and external reviewers will spot if you haven’t included recent research or ideas, but we don’t need page after page of review — in the proposal section we’re evaluating what you plan to do not what you or someone else already did.
Clearly identify which HQP will do which activities. Try your hardest to simplify the way you refer to projects and HQP. Readers are going to have a hard time if you have Project 1a ii) assigned to MSc4, PhD1, and BSc4–10 — what was Project 1a ii) again? And what are those BSc people doing, and how are MSc4’s and PhD1’s contributions different?
Do use a figure or table if it helps articulate aspects of the proposed research.
Use a number citation system like that used in a Science or Nature paper, it will save you a lot of space in your 5-page limit to the proposal.
You can save space on references by referring to your CCCV publications by number and only include those extra references that aren’t on the CCCV on the reference list you can supply. A common technique is to state early on that refs 1–33 refer to your CCCV and 34+ are listed on the references page, for example.
Don’t think you need to have loads of objectives and many projects under each objective — successful proposals can have just a couple of objectives with a couple of well-described described projects assigned to each. Sometimes less really is more.

Training of Highly Qualified Personnel

NSERC, like the other Tri-Agencies, is invested in training highly qualified people and a successful DG application will have to hit a number of criteria to do well on the rating.

There are two areas that Readers consider here;

the applicant’s past track record of training HQP, and
the applicant’s training philosophy and training plan

The past track record speaks to previous HQP that you have trained and the extent to which those HQP have moved on to successful positions that use the skills they learned. Again, this is not a numbers game and quality trumps quantity, but you do need to demonstrate a track record. If you are early in your career and don’t have much of record, be honest and include what you have, including current trainees, on the CCCV. In the Past Contributions to HQP Training you can explain your training record and point out if you have some past experience, perhaps informally as a post-doc; but remember the mantra and show us the evidence.

In your CCCV do indicate where your listed HQP are now and what they are doing. You can also discuss this in the Past Contributions to HQP Training section, highlighting particular past trainees, perhaps to indicate if those trainees got awards or prestigious scholarships. If a trainee withdraws from their programme, don’t leave it up to the Reader to infer why, tell us. This section is also a good place to indicate if HQP are publishing and to highlight HQP contributions to those publications on your CCCV. Also give numbers of presentations given by HQP and perhaps highlight an important talk that they gave or a best talk or poster award they may have received.

Your past contributions are also assessed in terms of the training environment you provide to HQP; exactly where in the various sections on philosophy, training plans, and past contributions to HQP you put this is up to you, but do describe the environment in which your HQP training takes place and what facilities and opportunities are afforded to HQP that you train. If there is a particularly innovative course or workshop run at your institution, tell you Readers about it.

The other half of the rating for HQP is based upon your approach to training (your Training Philosophy) and the training plans for individual HQP. I’ve already mentioned that it is important to clearly indicate which trainees are doing which aspects of the proposed research, and that you need to assign HQP to appropriate tasks given their career stage. This is where a clear Proposal section that ties in nicely to your HQP Training Plan section can really help you. Don’t duplicate extensive information in more than one section, but do refer between the Proposal and the HQP Training Plan sections.

The training plans should also include information about how you actively train HQP in the various lab, field, taxonomic, soft, and transferable skills appropriate to your lab or setting. Do you teach data analysis, or science communication? Do you have lab meetings and how often? Here’s were you describe these more generic items that cut across multiple HQP trainees. You need to have information on the individual training plans for specific HQP (this include the projects they’ll do in the Proposal) as well as on these more general skills.

Your Training Philosophy refers to your approach to HQP training. Are you hands-on or do you favour a looser working relationship with your HQP? Do you prefer a small lab or a larger lab of trainees? And how do you manage that; do you have senior HQP (PDFs) helping to train more junior members for example?

Everyone holds lab meetings, helps their HQP publish, and sends HQP to conferences so that they may present their research. What is it that you do that is unique or different?

The final component of the HQP section is the EDI (Equity, Diversity, Inclusion) statement, which is new this year as a requirement. It forms part of the training philosophy and training plan half of the HQP rating.

What are Readers looking for on EDI? First we are asked to look for some indication that the applicant understands what the barriers to entry and challenges in recruitment are for underrepresented groups in the applicant’s particular field of research and at the applicant’s institution. Again, provide evidence to support your assertions; reach out to your Faculty, Research Office, or EDI person/office at your institution to get specific information on challenges at your institution, and consult the literature or relevant scholarly societies for evidence to support your statement regarding your field of research.

Secondly, Readers will be looking for specific actions or activities that you have done, and or will do, to support recruitment of underrepresented groups to your lab and to provide all HQP that you supervise with an inclusive environment for their training. As always, give evidence and be specific, providing detail. Have you taken unconscious bias training? Are HQP positions advertised broadly with specific attempts to advertise via outlets that specialize in or cater to particular groups? Do you have a Code of Conduct for your lab?

In 2019 NSERC asked for the EDI statement to be included in the proposal though they didn’t require it and many people didn’t include anything on EDI in their DG application. This year it is a requirement and there are specific sections on The Grid that Readers can use to evaluate it. It’s a soft requirement though; you don’t need to include it, but if you don’t you’ll get an Insufficient rating for that element of the Training Philosophy and Research Training Plan component of the overall HQP rating. That usually won’t be enough to pull an applicant down one entire bin (i.e. if everything else had you at a solid Strong for HQP then all else equal the missing EDI statement shouldn’t pull you down to Moderate) and also isn’t sufficient to render an overall rating of Insufficient for HQP either. Where it can make a difference is if you are borderline for a particular rating — a low Strong rating could be pulled down to a Moderate of all or parts of the EDI criteria are missing, while a high Very Strong could get pulled up to an Outstanding rating if the applicant does a good job with the EDI statement.

Early career researchers and HQP

A note on ECRs; as Readers, we aren’t supposed to consider any element of the DG evaluation in terms of the applicant’s career stage. This may seem to be unfair to ECRs, as how could they possibly have any track record of training HQP if they are just starting out in their first academic position⁴. Well, it is unfair and NSERC recognizes this.

It is not uncommon for a ECR to warrant a rating of Insufficient for their record of past HQP training. However, as long as an ECR provides a good training plan and training philosophy section, this will be enough to pull them up to a Moderate rating overall for HQP. For ECRs only NSERC will fund down to the Strong-Strong-Moderate bin; assuming they rated well on their EoR and MoP sections an ECR will not be unfairly treated by a non-existent or relatively poor HQP track record.

Furthermore, currently NSERC gives ECRs that receive funding an extra $5,000 per year plus a one-time amount (the value of which I can’t quite recall just now) to help kick-start their DG careers, plus the option of a sixth year of funding at their level if they wish.

Other tips for preparing a good HQP section:

Do follow the instructions and indicate HQP co-authors with a * on the CCCV.
The only presentations you should list on the CCCV are the ones where you were the presenting author.
Do not list presentations given by HQP as presenting authors on your CCCV, but do indicate if they are co-authors on any of the talks you presented, again with an *
Don’t use Academic Advisor for HQP to pad your numbers. If you do have a number of trainees where your supervision was not a strict Primary- or Co-supervision role, then you can use this Academic Advisor role but you must do a good job of explaining your role in the training of those HQP and what particular skills or training you contributed yourself. Don’t use this as a way to add HQP to your CCCV where you were on a supervisory committee without justifying this and giving evidence of your contributions as we all sit on graduate committees. If you went above and beyond as a committee member, then this might be a good reason to include that HQP on your CCCV, but you will need to clearly explain why your supervision was important.
It’s OK to not have identified HQP by name in the Proposal or HQP Training Plan sections, but do be clear when you refer to particular HQP so Readers can clearly identify who is doing what; use PhD1, MSc2 etc. instead.
Training is valued at all levels; it doesn’t matter if you haven’t trained any PhDs or MScs as particular departments and programmes do not offer graduate degrees. NSERC is fair to all institutions and rewards training activities at all levels.

Random stuff

I’ve tried to outline above some of the key areas where DG applicants succeeded or rated poorly over the 130 odd DGs that I evaluated over the past three years. Bear in mind that I’m writing this just after the 2020 competition evaluations; NSERC my change the requirements and instructions in future years, so do confirm details with the NSERC website if you’re submitting in November 2020 or later.

Here are a few general points that apply broadly when preparing your DG application:

Read the instructions! These are currently provided in a poor format on the NSERC website. Do print out the Instructions for Completing and Application web page for the DG programme and highlight any specific instructions as they’re often buried in the narrative text. Then be sure to revisit your highlights to ensure that are doing what NSERC has requested of you.
Print out The Grid and refer to it often when preparing your DG application. Write your proposal to The Grid; the terminology might be obtuse and the differences between ratings obscure, but if it asks for things to be evident to get a Strong or clearly evident to get a Very Strong, make sure a reasonable Reader will think you provided clear evidence for a given indicator.
Read the Peer Review Manual; it’s tedious but it will help you prepare a DG application that is ready for Reader scrutiny if you take into account what it is that your Readers are required to do to assess your application. In particular, read the sections on the Merit Indicators as they provide more detail and nuance to the statements on the The Grid.
The CCCV software is appalling and it takes a long time to prepare a good CCCV for NSERC DGs. Start early and complete it fully, taking into account specific instructions NSERC provides to you.
There are things that you might have included on your generic CCCV that come through to your NSERC one that aren’t needed; don’t delete these, just print off the final version and then go through and see if everything that is shown needs to be there and if it doesn’t need to, exclude it in the NSERC version (you can un-check any individual entry of the CCCV to stop it being included on the NSERC Researcher CCCV). Examples of this might be extensive Journal Reviewer information; all DG applications review for journals so you might not want to include a detailed list of reviewing activities which might obscure more senior or important contributions such as reviewing for funding bodies or being on the editorial board of a journal.
Ask other researchers at your institution and colleagues at other institutions to read your DG application and give you feedback. Also get someone in your Research Office who is responsible for NSERC grants to read through and give you advice.
Your DG application is primarily evaluated by your five Readers. Those Readers will take into account the external reviews of your application, but your ratings will be primarily based on the Readers’ evaluations. Don’t be surprised if your final ratings don’t mesh with the comments from the overly enthusiastic reviewer, who may not be as familiar with The Grid and the Merit Indicators as your Readers.

Final thoughts

What struck me most — besides the general excellence of the applicants that I evaluated — is just how much care has gone into ensuring that the process is fair to everyone. As an applicant, your grant is evaluated by five careful and knowledgeable Readers plus at least one external reviewer. The NSERC Programme Officers and other staff are exceptional and take pride in running a process that is fair to everyone given the policy restrictions in play. NSERC DGs value so much more than how many Nature or Science papers you have and how many HQP you’ve trained. We might disagree over the extent to which the quality of other people, which is beyond the scope of the applicant’s ability to affect, contributes to the rating for an individual grant, but given the policies that NSERC has pursued, everything that I have witnessed during my time on EG 1503 assures me that this is fair and inclusive process, rewarding a great many excellent researchers in Canada.

If you have questions about anything I have written above, please ask in the comments below or drop me an email; I’ll do my best to answer them. Also, nothing I wrote above is official NSERC policy; these comments are mine and mine alone, but they do reflect what I have observed and learned in evaluating many DGs these past few years.

By way of example, my current DG is $29,000 a year for five years, which includes a top-up as I was an Early Career Researcher (ECR) when I applied, and the top amount possible in 1503 is in the region of $170,000 a year if you can attain the top bin of Exceptional-Exceptional-Exceptional.↩
For 1503 and a number of Evaluation Groups; other Evaluation Groups meet at different times in February. 1506 (Geoscience) met in the first week of February, and 1507 (Computer Science) met the second week of February, for example.↩
The Programme Officer knows the breakdown of the individual ratings but not the identity of who voted what.↩
ECRs are currently defined as being within five years of their first NSERC eligible position.↩

Pivoting tidily

2019-10-25T00:00:00+02:00

One of the fun bits of my job is that I have actual time dedicated to helping colleagues and grad students with statistical or computational problems. Recently I’ve been helping one of our Lab Instructors with some data that from their Plant Physiology Lab course. Whilst I was writing some R code to import the raw data for the lab from an Excel sheet, it occurred to me that this would be a good excuse to look at the new pivot_longer() and pivot_wider() functions from the tidyr package. In this post I show how these new functions facilitate common data processing steps; I was personally surprised how little data wrangling was actually needed in the end to read in the data from the lab.

In the lab course the students conduct an experiment to study the effect of the plant hormone gibberellin on plant growth. Over a number of weeks the students apply gibberellic acid (in two concentrations) or daminozide, a gibberellic acid antagonist, to the tips of the leaves of pea plants that are grown in a growth chamber with a 16-hour photoperiod. The students work in groups, with some of the groups growing the wild-type cultivar, whilst others work with a mutant dwarf cultivar. Each group has six plants per treatment level, and every seven days the students measure the height of each plant and the number of internodes that each plant has. On the last day of the experiment the plants are harvested and their fresh weight measured.

The pea plants from the 2019 Plant Physiology Lab course, toward the end of the experimental period

Originally the data were recorded in a less than satisfactory way — let’s just say the original data sheets would have been good candidates for one of Jenny Bryan’s talks on spreadsheets. After being cleaned up a bit, we have something that looks like this in Excel

Raw data in the Excel Workbook

This isn’t perfect as we have data in the column names — the numbers after the colons are the day of observation — but it is a pretty simple layout for the students to complete, and this is how we decided to ask the students to record the data during the 2019 lab course, so this is what we have to work with going forward.

Ultimately we want to be able to refer to columns named height, internodes, etc depending on the statistical analysis the students will do, and we’re going to need a column with the observation days in it.

Pivoting

If you’re not familiar with pivoting, it is important to realize that we can store the same data in a wide rectangle or a long (or tall) rectangle

Examples of wide and long representations of the same data. Source: Garrick Aden-Buie’s ((???)) Tidy Animated Verbs

The same information is stored in both the long and wide representations, but the two representations differ in how useful they are for certain types of operation or how easily they can be used in a statistical analysis. It’s also worth noting that there are more than just long or wide representations of the data; as we’ll see shortly, the long representation of the Plant Physiology Lab data is too general and we’ll need to arrange the data in a slightly wider form.

Moving between long and wide representations is known as pivoting. The animation below show the general idea of how the cells in one format are rearranged into the other format, with the relevant metadata that doesn’t get rearranged being extended or reduced as needed so we don’t loose any information.

Pivoting between wide and long representations of the same data. Source: Garrick Aden-Buie’s ((???)) Tidy Animated Verbs modified by Mara Averick ((???))

With the lab data I showed earlier, we’re going to need to pivot from the original wide format into a longer format — just as the animation above shows. As we want to output an object that is longer than the input we will use the pivot_longer() function.

To start we will need to import the data from the .xls sheet, which I’ll do using the readxl package

library('curl')    # download files
library('readxl')  # read from Excel sheets
library('tidyr')   # data processing
library('dplyr')   # mo data processing
library('forcats') # mo mo data processing
library('ggplot2') # plotting
theme_set(theme_bw())

## Load Data
tmp <- tempfile()
curl_download("https://github.com/gavinsimpson/plant-phys/raw/master/f18ph.xls", tmp)
plant <- read_excel(tmp, sheet = 1)

We have to download the data first — which I do using curl_download() from the curl package — because read_excel() doesn’t currently know how to read from URLs at the moment.

Now we have our plant data within R, stored in a data frame

plant

# A tibble: 24 x 12
   treatment cultivar plantid `height:0` `internodes:0` `height:7`
   <chr>     <chr>      <dbl>      <dbl>          <dbl>      <dbl>
 1 control   wt             1        235              4        525
 2 control   wt             2        182              3        391
 3 control   wt             3        253              3        452
 4 control   wt             4        151              3        350
 5 control   wt             5        195              3        335
 6 control   wt             6        187              4        190
 7 ga10      wt             1        250              4        458
 8 ga10      wt             2        220              4        345
 9 ga10      wt             3        180              2        300
10 ga10      wt             4        230              4        510
# … with 14 more rows, and 6 more variables: `internodes:7` <dbl>,
#   `height:14` <dbl>, `internodes:14` <dbl>, `height:21` <dbl>,
#   `internodes:21` <dbl>, `freshwt:21` <dbl>

To go to the long representation we have to tell pivot_longer() a couple of bits of information

the name of the object to pivot,
which columns contain the data we want to pivot (or alternatively which columns not to pivot if that is easier),
the name we want to call the new column that will contain the variable name information from the original data, and
optionally, the name of the new column that will contain the data values. The default is to name this column value so you don’t need to change this if you’re happy with that.

So, to get our wide plant data into a longer format we would do this

pivot_longer(plant, -(1:3), names_to = "variable")

# A tibble: 216 x 5
   treatment cultivar plantid variable       value
   <chr>     <chr>      <dbl> <chr>          <dbl>
 1 control   wt             1 height:0       235  
 2 control   wt             1 internodes:0     4  
 3 control   wt             1 height:7       525  
 4 control   wt             1 internodes:7     5  
 5 control   wt             1 height:14      810  
 6 control   wt             1 internodes:14   10  
 7 control   wt             1 height:21     1090  
 8 control   wt             1 internodes:21   14  
 9 control   wt             1 freshwt:21       7.2
10 control   wt             2 height:0       182  
# … with 206 more rows

The -(1:3) is short-hand for excluding the first three columns of plant from the pivot. Here, we’re creating a new variable called (imaginatively!) variable. As you can see we now have our data in a much longer representation, with a single column containing all of the observations that this group of students made.

However, we have a bit of a problem: we have the added complication that some of the column names contain actual data that we want to use. While we have a column containing this information — it is not lost — the observation day or variable name information is not directly accessible in this format. What we could do is split the strings in this new variable column on “:” and form two new columns from there.

Thankfully, this is such a common operation that pivot_longer() (and it’s predecessor, gather()) can do this for you — all you have to do is tell pivot_longer() what character to split on, and what names you want for the columns that result from splitting the strings up.

pivot_longer(plant, -(1:3), names_sep = ":", names_to = c("variable","day"))

# A tibble: 216 x 6
   treatment cultivar plantid variable   day    value
   <chr>     <chr>      <dbl> <chr>      <chr>  <dbl>
 1 control   wt             1 height     0      235  
 2 control   wt             1 internodes 0        4  
 3 control   wt             1 height     7      525  
 4 control   wt             1 internodes 7        5  
 5 control   wt             1 height     14     810  
 6 control   wt             1 internodes 14      10  
 7 control   wt             1 height     21    1090  
 8 control   wt             1 internodes 21      14  
 9 control   wt             1 freshwt    21       7.2
10 control   wt             2 height     0      182  
# … with 206 more rows

The changes we made above were to specify names_sep with the correct separator, and we pass a vector of new column names to names_to rather than the single name we provided previously.

Those of you with good eyes may have noticed another problem that we will encounter if we stopped here. The day variable that was just created is stored as a character vector. It is likely that we’ll want this information stored as a number if we’re going to analyze the data. We can do the required conversion within pivot_longer() call by specifying what the developers have started calling a prototype across many of the tidyverse packages. A prototype is an object that has the same properties that you want objects built from that prototype to take. Here we want the day variable as a column of integer numbers, so we set the prototype for this vector to integer() using the names_ptypes argument

plant <- pivot_longer(plant, -(1:3), names_sep = ":", names_to = c("variable","day"),
                      names_ptypes = list(day = integer()))
plant

# A tibble: 216 x 6
   treatment cultivar plantid variable     day  value
   <chr>     <chr>      <dbl> <chr>      <int>  <dbl>
 1 control   wt             1 height         0  235  
 2 control   wt             1 internodes     0    4  
 3 control   wt             1 height         7  525  
 4 control   wt             1 internodes     7    5  
 5 control   wt             1 height        14  810  
 6 control   wt             1 internodes    14   10  
 7 control   wt             1 height        21 1090  
 8 control   wt             1 internodes    21   14  
 9 control   wt             1 freshwt       21    7.2
10 control   wt             2 height         0  182  
# … with 206 more rows

Notice that we pass names_ptypes a named list of prototypes, with the list name matching one or more of the variables listed in names_to.

Now we have successfully wrangled the data into a long format and recovered the information hidden in the column names of the original data file. However, as it stands, we can’t easily use the data in this format in a statistical model. We want the students on the course to analyze the data to estimate what effects the treatments have on the height of the plants over the course of the experiment. With the data in this long format we don’t have a variable height containing just the height of the plants that we can refer to in a linear model say.

What we want is to create new columns for height, internodes and freshwt and pivot the value data out into those columns. As we’re adding columns we’re making the data wider, so we can use the pivot_wider() function to do what we want. Now we need to tell pivot_wider()

where to take the names of the new variables that are going to be created from — here that’s the variable column, and
where to take the data values from that are going to be put into these new columns — here, that’s the value column

plant <- pivot_wider(plant, names_from = variable, values_from = value)
plant

# A tibble: 96 x 7
   treatment cultivar plantid   day height internodes freshwt
   <chr>     <chr>      <dbl> <int>  <dbl>      <dbl>   <dbl>
 1 control   wt             1     0    235          4    NA  
 2 control   wt             1     7    525          5    NA  
 3 control   wt             1    14    810         10    NA  
 4 control   wt             1    21   1090         14     7.2
 5 control   wt             2     0    182          3    NA  
 6 control   wt             2     7    391          5    NA  
 7 control   wt             2    14    615          9    NA  
 8 control   wt             2    21    810         12     3.8
 9 control   wt             3     0    253          3    NA  
10 control   wt             3     7    452          6    NA  
# … with 86 more rows

As with other tidyverse package, we don’t have to quote the names of the columns we want to pull data from.

There are a couple of other things we need to do to make the data fully useful:

it would be helpful to have a unique identifier for each individual plant — currently the plantid is just the values 1:6 repeated for each treatment group,
it would also be good practice to convert treatment into a factor, and to set the control treatment as the reference level against which the other treatment levels will be compared — if we didn’t do that, the b9 level (daminozide treatment) would be the reference level

We can do those data processing steps quite easily now we have the data imported and arranged nicely the way we want them

plant <- mutate(plant,
                id = paste0(cultivar, "_", treatment, "_", plantid),
                treatment = fct_relevel(treatment, 'control'))
plant

# A tibble: 96 x 8
   treatment cultivar plantid   day height internodes freshwt id          
   <fct>     <chr>      <dbl> <int>  <dbl>      <dbl>   <dbl> <chr>       
 1 control   wt             1     0    235          4    NA   wt_control_1
 2 control   wt             1     7    525          5    NA   wt_control_1
 3 control   wt             1    14    810         10    NA   wt_control_1
 4 control   wt             1    21   1090         14     7.2 wt_control_1
 5 control   wt             2     0    182          3    NA   wt_control_2
 6 control   wt             2     7    391          5    NA   wt_control_2
 7 control   wt             2    14    615          9    NA   wt_control_2
 8 control   wt             2    21    810         12     3.8 wt_control_2
 9 control   wt             3     0    253          3    NA   wt_control_3
10 control   wt             3     7    452          6    NA   wt_control_3
# … with 86 more rows

Here I just pasted together the cultivar, treatment and plantid information into a unique id for each individual plant. This won’t be used directly by the students in any analysis they do as this is a second year course and they don’t know about mixed models (yet), but it is handy to have this id available for plotting. The treatment variable is converted to a factor and the reference level set to be “control” using the fct_relevel() function from the forcats package.

The students will do one other step before proceeding to look at the data — each sheet in the .xls file contains observations from a single group and hence a single cultivar, and we want the students to compare cultivars. So they will repeat the steps above to import a second sheet of data containing data from the cultivar they didn’t work with, and then stick the two data sets together. But I’ll spare you having to repeat that.

If you’re interested, this is what the data look like, for a single cultivar and single group

ggplot(plant, aes(x = day, y = height, group = id, colour = treatment)) +
    geom_point() +
    geom_line() +
    labs(y = 'Height (mm)', x = 'Day', colour = 'Treatment')

Plot of the plant growth data

(and now you can see why I needed a unique plant identifier even though the students will essentially ignore this clustering in the data when the analyse it.)

The .xls file we downloaded at the start of the script contains multiple sheets all formatted the same way, so we could pull in all the data into one big analysis if you wanted, but in the lab we’re just getting the students one set of wild-type and mutant cultivars. I’m grateful to Dr. Maria Davis, the lab instructor for the course, for making the data from the course available to anyone who wants to use it — if you do use it, be sure to give Maria and the 2018 cohort of BIOL266 Plant Physiology students at the University of Regina an acknowledgement.

If you’re interested in the statistical analyses that we’ll be getting the students to do in the lab, I have an (at the time of writing this, almost finished) Rmd file in the GitHub repo for the lab course with all the instructions. It’s pretty simple ANOVA and ANCOVA analyses, but we do get the students to do post hoc testing using the excellent emmeans package, if you’re interested.

Finally, none of the data wrangling I did above is that complex, and I certainly didn’t need to use tidyr and dplyr etc to achieve the result I wanted. It is quite trivial to do this pivoting and wrangling in base R; we could just uses the reshape() function, strplit(), etc. However, if you’ve ever used reshape() you’ll know that the argument names for that function make no sense to anyone except perhaps the person that wrote the function. The real advantage of doing the wrangling using tidyr and dplyr is that we end up with code that is much more easy to read and understand, which is very important for students on these courses, who will have had little to no exposure to programming and related data science techniques.

Anyway, happy pivoting!

radian: a modern console for R

2019-06-18T00:00:00+02:00

Whenever I’m developing R code or writing data wrangling or analysis scripts for research projects that I work on I use Emacs and its add-on package Emacs Speaks Statistics (ESS). I’ve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldn’t call myself an Emacs expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my go-to methods for configuring Emacs+ESS were failing; indentation was all over the shop, the smart _ stopped working or didn’t work as it had for over a decade, syntax highlighting of R-related files, like .Rmd was hit and miss, and polymode was just a mystery to me. Configuring Emacs+ESS was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasn’t fun any more. So, in a fit of pique following one to many reconfiguration sessions of Emacs+ESS, I went in search of some greener grass. During that search I came across radian, a neat, attractive, simple console for working with R.

Written by Randy Lai, radian is a cross-platform console for R that provides code completion, syntax highlighting, etc in a neat little package that runs in a shell or terminal, such as Bash. I’m someone who fires up multiple terminals every day to run some bit of R code, to show a student how to do something, to quickly check on argument names or such like, or prepare an answer to a question on stackoverflow or crossvalidated. Running R in a terminal after using an IDE/environment like Emacs+ESS or RStudio is an exercise in time travel; all those little helpful editing tools the IDE provides are missing and you’re coding like it was the 1980s all over again. radian changes all that.

radian is a Python application so to run it you’ll need a python stack installed. You’ll also need a relatively recent version of R (≥ 3.4.0). Using pip, the python package installer, installing radian is straightforward. Python v3 is recommended and on Fedora this mean I had to install using

pip-3 install --user radian

The –user flag does an user install, which sets the installation location to be inside your home directory. Once installed, you can start radian by simply typing the application name and hitting enter

radian

A nice configuration tip included in the radian README.md is to alias the radian command to r, so that running R runs the standard R console, while running r starts radian. On Fedora, you configure this alias in your ~/.bashrc file

alias r="radian"

Having started radian you’ll see something like this

radian at start-up running in a bash shell on Fedora

radian starts up with a simple statement of the R version running in radian and the platform (OS) it’s running on; so is it just a less-verbose version of the standard R console? The radian prompt hints at the greater capabilities however.

Code completion is a nice addition; yes, you have some form of code completion in the standard R console but in radian we have a more RStudio or Emacs+ESS-like experience with a drop-down menu for object, function, argument, and filename completion. To activate this you start typing, hit Tab and the relevant completions pops up. Hit Tab again or press the down cursor and you can scroll through the potential completions.

Code completion in radian

We also get nice syntax highlighting of R code using the colour schemes from pygments:

Syntax highlighting in radian using the monokai theme

And, if you’re copying & pasting code into the terminal or piping code in from a editor with an embedded terminal (that’s running radian) then you also get rather handy multiline editing. Pressing the up cursor ↑ will retrieve the previous set of commands pasted or piped into radian, and repeatedly pressin ↑ will scroll back through the history. If you want to edit a set of R calls, instead of pressing ↑ again, press ← to enter the chunk of code and then you can move around among the lines using the cursor keys, editing as you see fit. Hitting enter will run the entire chunk of code for you, edits and all:

Multiline editing a ggplot call in radian

You can configure aspects of the behaviour of radian via options() in your .Rprofile. The options I’m currently using on the computer used for the screenshots in this post are:

options(radian.auto_indentation = FALSE)
options(radian.color_scheme = "monokai")

but on my laptop I’m currently using

# auto match brackets and quotes
options(radian.auto_match = TRUE)

# auto indentation for new line and curly braces
options(radian.auto_indentation = TRUE)
options(radian.tab_size = 4)

# timeout in seconds to cancel completion if it takes too long
# set it to 0 to disable it
options(radian.completion_timeout = 0.05)

# insert new line between prompts
options(radian.insert_new_line = FALSE)

The last option is something I’m not sure about yet; as you can see in the screenshots, there’s a new line between the prompts, which makes it super easy to read the R code you’ve entered, but with the font I’m currently using (Iosevka) things look a bit too spread out. Setting radian.insert_new_line = FALSE, as I have it on the laptop results in more standard behaviour but it can feel a little cramped. I’ll probably play with both options and see which I like best after a few more weeks of use.

You can also define shortcuts. This is useful for entering the assignment operator <-, which I have bound to Alt + - using

options(radian.escape_key_map = list(
          list(key = "-", value = " <- ")
        ))

where I’ve added spaces around the operator to mimic how the smart underscore works in Emacs+ESS.

I’m really liking using radian for my throw-away R sessions that I typically do in a terminal. The only issue I’ve noticed is that it is a little slow to print tibbles, and clearly it’s not going to replace my current IDE — that’s not what it is designed for. That said, radian can be run inside any app that can run a terminal and I’ve had this running inside VS Code for example, which was nice.

If you have any comments on radian or other R consoles, let me know what you think below; if you’ve used radian I’m especially interested in your experience with it.

Tibbles, checking examples, & character encodings

2019-01-22T14:00:00+01:00

Recently I’ve been preparing my gratia package for submission to CRAN. During my pre-flight testing I noticed an issue under Windows checking the examples in the package against the reference output I generated on linux. In the latest release of the tibble package, the way tibbles are printed has changed subtly and in a way that leads to cross-platform differences. As I write this, tibbles with more than a set number of rows are printed in a truncated form, showing only the first 10 rows of data. In such cases, a final line is printed with an ellipsis and a note as to how many more rows are in the tibble. It was this ellipsis that was causing the cross-platform issue where differences between the output generated on windows and the reference output were being identified during R CMD check on Windows. If this is causing you an issue, here’s one way to solve the problem.

The problem is this:

library('tibble')
as_tibble(iris)

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

Note that little ellipsis on the last line. Yes, those three little dots — that … was what was causing all the trouble. Don’t get me wrong, I’m all on board when it comes to proper typography, but for something so small, that one … caused a good deal of hair-pulling as I prepared my package for a clean submission to CRAN!

On Windows you won’t see that cute little …; instead you’ll see this

as_tibble(iris)

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 140 more rows

Yes, a rather ugly, second-rate approximation of the …, I think you’ll agree!

I have to thank Brodie Gaslam ((???)) on Twitter) for identifying the source of the difference between output on Linux and Windows and for suggesting the solution I show below. What Brodie identified was that the cli package, which tibble uses to show this ellipsis, contains code to determine what system it is running on and to adjust it’s output accordingly. So, on Linux you see … and on Windows you see …, because (I assume many) Windows systems aren’t set up to understand what … is. What I see on Linux is thanks to Unicode (specifically I have UTF-8 encoding in my Linux sessions) but this doesn’t work (or, as easily) on Windows, which defaults to a different character set or encoding, and which has no idea what … is.

As it turns out, there doesn’t appear to be a simple way to make Windows, certainly not the CRAN Windows build system. But what we can do, which is what Brodie mentioned to me on Twitter, is to set a global option that the cli package looks for to control its behaviour on Linux. That’s right, we’re going to reduce output generated under R CMD check to the lowest common denominator; the user will still get the benefit of the fancy typography that cli affords their R sessions, but we don’t need that fanciness for checking the examples.

The option you need is cli.unicode and it needs to be set to FALSE. Here it is in action

as_tibble(iris)
op <- options(cli.unicode = FALSE)
as_tibble(iris)
options(op)

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ... with 140 more rows

To make this work in an example, you will want to include it in a block, which will not show up in the help for the function, nor will it be executed if a user runs the example via example(). Setting the option this way will only happen during testing via R CMD check.

In the roxygen2 sources for my example I now have

@examples

\dontshow{
op <- options(cli.unicode = FALSE)
}
# do something here
\dontshow{options(op)}

This idiom is required to handle more issues than just this character encoding problem. Most of my examples use simulated data, so I need a set.seed() call inside . If you are showing the results of any statistical model, you’ll already be reducing the number of digits shown in the output via options(digits = 5) as CRAN gets annoyed if you are checking results to silly levels of precision. The user doesn’t need to see any of this.

I should note that I’m not using this checking of example output as true unit test (there’s loads of those in the /tests folder of the package thank you very much), but I do still think that checking the output of examples against reference output is useful. At the very least it doesn’t (usually) hurt to check the output when it’s being generated as part of the checks anyway. I also want useful examples so I tend to show snippets of outputs as part of the example. Having the comparison between expected and actual output is a handy check of what I’m presenting to the user.

Hopefully this is useful to people coming across the same or similar issues with their packages. And thanks again to Brodie for explaining what the problem was.

What's wrong with software paper preprints on EarthArXiv?

2018-12-20T00:00:00+01:00

Via Twitter I recently found out that EarthArXiv, a new preprint server for the geosciences doesn’t accept software paper submissions. Actually, EarthArXiv doesn’t accept quite a few types of publication — some justifiably, like ad hominem attack pieces, others unjustifiably like correspondence or opinion pieces. I find this general stance very odd indeed; commentary, editorial or opinion pieces and software papers are accepted in a large number of the general and specialized journals that serve the geoscience field, so why wouldn’t EarthArxiv want to host these prior to publication of the version of record in one of those journals?

The commentary issues bothers me a lot; there is far too little commentary in the geoscience literature and, unless our field is unlike any other, there is a lot to critique. Yet typically this correspondence never sees the light of day or is subject to such draconian restrictions on length that the discussants rarely have the opportunity to fully articulate their concerns or defend their positions. And that’s assuming the commentary is submitted within the time window allowed by the journal or that the editor decides to allow the commentary in the first place. Accepting commentary on published, peer reviewed articles would be a good first step in promoting collegial academic discussion in the literature. Deity knows we need it!

Anyway, back to what really annoyed me this morning; not accepting software papers.

I was pleased to see that “EarthArXiv supports scientific software development and citation”. That’s good to know, because the impression that I’m left with is that software papers aren’t the right sort of thing for EarthArXiv. No reasons for this stance are given beyond the nebulous “Yet, software papers often follow citation standards that differ from research and data papers.” Citation standards also differ for data, but data papers are acceptable (which is a good thing!) at EarthArXiv. So what’s the problem with software papers?

I’d like to know because I’m biased; I write a lot of software that is freely available to the community under permissive open source licences. I’m far from being the only one. If researchers who use my or others’ software to analyze their data or prepare their figures can submit preprints to EarthArXiv, why are we barred from submitting preprints about that software? It makes no sense to me.

EarthArXiv does give some useful tips on what you as a software author can do instead;

Use GitHub — this really should be “Use version control”!!
Mint a DOI for the repo on Zenodo
Publish a paper in JORS or JOSS — have you ever seen a paper from either of these? I have and what they do is great, but JOSS, and to a lesser extent JORS, don’t publish the kinds of detail one typically finds in a software paper at say Methods in Ecology and Evolution, where the reasons behind method choice or implementation details are regularly presented. They then say “You will now have a citable ‘paper’”, why the scare-quotes? Do the moderators at EarthArXiv not think such papers are real papers?

The final bit of advice is:

If you really want a software paper on EarthArXiv such that Earth scientists can find it, then we recommend doing all the above plus writing up a short PDF with some Earth science examples showing off the utility. That EarthArXiv PDF would cite the Journal of Open Source Software report

Isn’t that the very definition of a software paper?

This leaves me with the impression that the EarthArXiv moderators have a very particular type of software paper in mind and haven’t considered — or are not aware of — the broader forms of software papers. One of my software papers is Simpson (2007), which describes how to use my analogue R package. Another example is Goring et al. (2015) in which we describe and illustrate how to use the neotoma R package to access the eponymous database Neotoma DB. Those are more typical of the software papers that I am familiar with. Significant effort goes into preparing these papers, easily as much as any other type of research paper. Papers like this serve very different needs than those published by JOSS. Neither of those papers was freely available to colleagues (IIRC) during the review process. A preprint on EarthArXiv would have served the community well.

It is frustrating in the extreme that papers like the two personal examples above would not be welcome on EarthArXiv.

I do hope that the people at EarthArXiv reconsider their stance on software papers and other types of scholarly work, especially commentary pieces.

References

Goring, S., Dawson, A., Simpson, G. L., Ram, K., Graham, R. W., Grimm, E. C., et al. (2015). Neotoma: A programmatic interface to the neotoma paleoecological database. Open Quaternary 1, 1–17. doi:10.5334/oq.ab.

Simpson, G. L. (2007). Analogue methods in palaeoecology: Using the analogue package. Journal of statistical software 22, 1–29.

Confidence intervals for GLMs

2018-12-10T15:00:00+01:00

You’ve estimated a GLM or a related model (GLMM, GAM, etc.) for your latest paper and, like a good researcher, you want to visualise the model and show the uncertainty in it. In general this is done using confidence intervals with typically 95% converage. If you remember a little bit of theory from your stats classes, you may recall that such an interval can be produced by adding to and subtracting from the fitted values 2 times their standard error. Unfortunately this only really works like this for a linear model. If I had a dollar (even a Canadian one) for every time I’ve seen someone present graphs of estimated abundance of some species where the confidence interval includes negative abundances, I’d be rich! Here, following the rule of “if I’m asked more than once I should write a blog post about it!” I’m going to show a simple way to correctly compute a confidence interval for a GLM or a related model.

Why is plus/minus two standard errors wrong?

Well, it’s not! However, the main reason why people mess up computing confidence intervals for a GLM is that they do all the calculations on the response scale. This results in symmetric intervals on this scale and the very real possibility that the intervals will include values that are nonsensical, like negative abundances and concentrations, or probabilities that are outside the limits of 0 and 1.

Think about a Poisson GLM fitted to some species abundance data. In this model there is an implied mean-variance relationship; as the mean count increases so does the variance. In fact, in the Poisson GLM, the mean and variance are the same thing. The implication of this is that as the mean tends to zero, so must the variance. If we had an expected count of zero the variance would also be zero, and our uncertainty about this value would also be zero. However, our model won’t ever return expected (fitted) values that are exactly equal to zero; it might yield values that are very close to zero, but never exactly zero. In that case we do have some uncertainty about this fitted value; the uncertainty on the lower end has to logically fit somewhere between the small estimated value and zero, but not exactly zero as we’re not creating an interval with 100% coverage.

We might also logically expect greater uncertainty above the fitted value, for our upper limit on the confidence interval; we’re saying that the true expected abudance is possibly somewhat larger than the fitted value and due to the mean-variance relationship, a larger fitted value is a larger mean value, which implies a larger variance, and consequently a larger amount of uncertainty above the fitted value than below.

Similar arguments can be made for models where there are both upper and lower limits to the response, such as binomial models where the response is a probability bounded between 0 and 1. As the fitted value approaches either boundary the uncertainty about the fitted value in the direction of the boundary gets squished up and the asymmetry of the confidence interval increases.

To illustrate, I’ll use a simple data set on wasp visits to leaves of the Cobra Lily, Darlingtonia californica. The data are on my blog and I’ve created a short link using bitly.com. If you want to follow along, load the data and some packages as shown

## packages
library('readr')
library('tibble')
library('dplyr')
library('ggplot2')
theme_set(theme_bw())

wasp <- read_csv('http://bit.ly/cobralily', skip = 1)
wasp <- mutate(wasp, lvisited = as.logical(visited))

The experiment used timed census of visitations by wasps to leaves of the Cobra Lily. These data come from Gotelli & Ellison’s text book A Primer of Ecologisal Satistics. Whether or not a wasp visited a leaf during the census was recorded along with the height of the leaf from the ground. The aim is to test the hypothesis that the probability of leaf visitation increases with leaf height.

Let’s jump right in and fit the GLM, a logistic regression model

mod <- glm(lvisited ~ leafHeight, data = wasp, family = binomial())
summary(mod)

Call:
glm(formula = lvisited ~ leafHeight, family = binomial(), data = wasp)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.18274  -0.46820  -0.23897  -0.08519   1.90573  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.29295    2.16081  -3.375 0.000738 ***
leafHeight   0.11540    0.03655   3.158 0.001591 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 46.105  on 41  degrees of freedom
Residual deviance: 26.963  on 40  degrees of freedom
AIC: 30.963

Number of Fisher Scoring iterations: 6

Now create a basic plot of the data and estimated model

## some data to predict at: 100 values over the range of leafHeight
ndata <- with(wasp, data_frame(leafHeight = seq(min(leafHeight), max(leafHeight),
                                                length = 100)))
## add the fitted values by predicting from the model for the new data
ndata <- add_column(ndata, fit = predict(mod, newdata = ndata, type = 'response'))

## plot it
plt <- ggplot(ndata, aes(x = leafHeight, y = fit)) +
    geom_line() +
    geom_rug(aes(y = visited, colour = lvisited), data = wasp) +
    scale_colour_discrete(name = 'Visited') +
    labs(x = 'Leaf height (cm.)', y = 'Probability of visitation')
plt

Estimated probability of visitation as a function of leaf height.

Next, to illustrate the issue, I’ll create the confidence interval the wrong way

## add standard errors
ndata <- add_column(ndata, wrong_se = predict(mod, newdata = ndata, type = 'response',
                                              se.fit = TRUE)$se.fit)
## compute a 95% interval the wrong way
ndata <- mutate(ndata, wrong_upr = fit + (2 * wrong_se), wrong_lwr = fit - (2 * wrong_se))

and plot the resulting interval

plt + geom_ribbon(data = ndata, aes(ymin = wrong_lwr, ymax = wrong_upr),
                  alpha = 0.1)

Estimated probability of visitation as a function of leaf height with an incorrectly-computed 95% confidence interval superimposed. Notice the interval exceeds the probability limits, 0 and 1.

That’s problematic because for significant sections of leafHeight our uncertainty interval breaks the laws of probability.

So, when creating confidence intervals we should expect asymmetric confidence intervals that respect the physical limits of the values that the response variable can take. If they don’t, then you’ve probably computed them the wrong way.

The previous paragraphs walked through a logical reason why confidence intervals are not symmetric on the response scale. The theory behind adding/subtracting two times the standard error is also derived for models where the response is conditionally Gaussian. It doesn’t really work properly at all when the response is not conditionally distributed Gaussian; you only need to realise that a confidence interval that includes impossible values can’t possibly have the coverage properties claimed because some part of it lies in a space of values that just won’t ever be observed.

Confidence intervals the right way

How do we create correct confidence intervals?

A simple solution is to create the interval on the scale of the link function and not the response scale. On the link scale, we’re essentially treating the model as a fancy linear one anyway; we asssume that things are approximately Gaussian here, at least with very large sample sizes. Given that assumption, we can create a confidence interval as the fitted value plus or minuss two times the standard error on the link scale, and the use the inverse of the link function to map the fitted values and the upper and lower limits of the interval back on to the response scale.

If you paid attention in your stats classes, you might know that the default link for the Poisson GLM is the logarithm link. You might also know that the inverse of taking logs is exponentiation. You may even know that exponentiation is done in R using the exp() function. But what’s the inverse of the logit function, which was the link used in our model for leaf visitation? Even if you knew what the correct mathematical function was, would you know what R function to use for this? And I defy most readers to know what the inverse of the complementary-log-log link function is, which we could have used instead of the logit link in our model. This problem only gets worse when we start thinking about models that walk and quack like a GLM but aren’t really GLMs in the strict sense, but which use families that are outside the usual suspects of the exponential family of distributions.

All is not lost however as there is a little trick that you can use to always get the correct inverse of the link function used in a model. (Well, always is a bit strong; the model needs to follow standard R conventions and accept a family argument and return the family inside the fitted model object.)

Typically in R, functions that fit generalized models take a family argument and return a family object that we can extract from the model itself. That family object contains all the information we need to create proper confidence intervals for GLMs and related models.

For the logistic regression model we fitted earlier, the family object is the same as that returned by binomial(link = ‘logit’), and we can extract it directly from the model using the extractor function family()

fam <- family(mod)
fam
str(fam)

Family: binomial 
Link function: logit 

List of 12
 $ family    : chr "binomial"
 $ link      : chr "logit"
 $ linkfun   :function (mu)  
 $ linkinv   :function (eta)  
 $ variance  :function (mu)  
 $ dev.resids:function (y, mu, wt)  
 $ aic       :function (y, n, mu, wt, dev)  
 $ mu.eta    :function (eta)  
 $ initialize:  expression({  if (NCOL(y) == 1) {  if (is.factor(y))  y <- y != levels(y)[1L]  n <- rep.int(1, nobs)  y[weights =| __truncated__
 $ validmu   :function (mu)  
 $ valideta  :function (eta)  
 $ simulate  :function (object, nsim)  
 - attr(*, "class")= chr "family"

If you look closely you’ll see a component named linkinv which is indicated to be a function. This is the inverse of the link function. The link function itself is in the linkfun component of the family. If we extract this function and look at it

ilink <- fam$linkinv
ilink

function (eta) 
.Call(C_logit_linkinv, eta)
<environment: namespace:stats>

we see something very simple involving an argument named eta, which stands for the linear predictor and means we need to provide values on the link scale as they would be computed directly from the linear predictor, () (this is the Greek letter eta). In this instance the function calls out to compiled C code to compute the neccessary values, but others are easier to understand and use simple R code, e.g. for the log link in the poisson() family we have

poisson()$linkinv

function (eta) 
pmax(exp(eta), .Machine$double.eps)
<environment: namespace:stats>

This shows that we exponentiate eta (which we know is the correct inverse function), and this is wrapped in pmax() to insure that the function doesn’t return values smaller than .Machine$double.eps</code>, the smallest (positive floating point) value \(x$ such that $1 + x \neq 1$. Now that we have a (generally) reliable way of getting the link function used when fitting a model, we can adapt thestrategy we used earlier so that we get the right (approximately) confidence interval. For this we need to <ul> <li>generate fitted values and standard errors on the link scale, using <code>predict(...., type = 'link')</code>, which happens to be the default in general, and</li> <li>compute the confidence interval using these fitted values and standard errors, and then backtransform them to the response scale using the inverse of the link function we extracted from the model.</li> </ul> For the wasp visitation logistic regression model then, we can do this using the following bit of code <figure class="highlight"> <pre><code class="language-r" data-lang="r">## grad the inverse link function ilink <- family(mod)\)linkinv ## add fit and se.fit on the link scale ndata <- bind_cols(ndata, setNames(as_tibble(predict(mod, ndata, se.fit = TRUE)[1:2]), c(‘fit_link’,‘se_link’))) ## create the interval and backtransform ndata <- mutate(ndata, fit_resp = ilink(fit_link), right_upr = ilink(fit_link + (2 se_link)), right_lwr = ilink(fit_link - (2 se_link))) ## show ndata

# A tibble: 100 x 10
   leafHeight     fit wrong_se wrong_upr wrong_lwr fit_link se_link
        <dbl>   <dbl>    <dbl>     <dbl>     <dbl>    <dbl>   <dbl>
 1       14   0.00341  0.00567    0.0147  -0.00792    -5.68    1.67
 2       14.7 0.00370  0.00605    0.0158  -0.00840    -5.60    1.64
 3       15.4 0.00401  0.00646    0.0169  -0.00891    -5.51    1.62
 4       16.1 0.00435  0.00690    0.0182  -0.00945    -5.43    1.59
 5       16.8 0.00472  0.00737    0.0195  -0.0100     -5.35    1.57
 6       17.5 0.00512  0.00786    0.0208  -0.0106     -5.27    1.54
 7       18.2 0.00555  0.00839    0.0223  -0.0112     -5.19    1.52
 8       18.9 0.00602  0.00895    0.0239  -0.0119     -5.11    1.49
 9       19.7 0.00653  0.00954    0.0256  -0.0125     -5.02    1.47
10       20.4 0.00708  0.0102     0.0274  -0.0133     -4.94    1.45
# ... with 90 more rows, and 3 more variables: fit_resp <dbl>,
#   right_upr <dbl>, right_lwr <dbl>

and now we can draw this interval on our plot from before

plt + geom_ribbon(data = ndata,
                  aes(ymin = right_lwr, ymax = right_upr),
                  alpha = 0.1)

Estimated probability of visitation as a function of leaf height with a correctly-computed 95% confidence interval superimposed. Notice the interval now doesn’t exceed the probability limits, 0 and 1.

And now we have confidence intervals that don’t exceed the physical boundaries of the response scale.

If you want different coverage for the intervals, replace the 2 in the code with some other extreme quantile of the standard normal distribution, e.g.

qnorm(0.005, lower.tail = FALSE) # for a 99% interval (0.5% in each tail)

[1] 2.575829

and if we’re being picky, if you have a small sample size and fitted a Gaussian GLM, then a critical value from the t distribution should be used

qt(0.025, df = df.residual(mod), lower.tail = FALSE)

[1] 2.021075

where I’m using the df.residual() extractor function to get residual degrees of freedom for the t distribution. This makes little sense for a logistic regression, but let’s just assume mod is a Gaussian GLM in this instance.

There we have it; a simple way to reliably compute confidence intervals for GLMs and related models fitted via well-behaved R model-fitting functions.

Introducing gratia

2018-10-23T14:00:00+02:00

I use generalized additive models (GAMs) in my research work. I use them a lot! Simon Wood’s mgcv package is an excellent set of software for specifying, fitting, and visualizing GAMs for very large data sets. Despite recently dabbling with brms, mgcv is still my go-to GAM package. The only down-side to mgcv is that it is not very tidy-aware and the ggplot-verse may as well not exist as far as it is concerned. This in itself is no bad thing, though as someone who uses mgcv a lot but also prefers to do my plotting with ggplot2, this lack of awareness was starting to hurt. So, I started working on something to help bridge the gap between these two separate worlds that I inhabit. The fruit of that labour is gratia, and development has progressed to the stage where I am ready to talk a bit more about it.

gratia is an R package for working with GAMs fitted with gam(), bam() or gamm() from mgcv or gamm4() from the gamm4 package, although functionality for handling the latter is not yet implement. gratia provides functions to replace the base-graphics-based plot.gam() and gam.check() that mgcv provides with ggplot2-based versions. Recent changes have also resulted in gratia being much more tidyverse aware and it now (mostly) returns outputs as tibbles.

In this post I wanted to give a flavour of what is currently possible with gratia and outline what still needs to be implemented.

gratia currently lives on GitHub, so we need to install it from there using devtools::install_github:

devtools::install_github('gavinsimpson/gratia')

To do anything useful with gratia we need a GAM and for that we need mgcv

library('mgcv')
library('gratia')

and an old favourite example data set

set.seed(20)
dat <- gamSim(1, n = 400, dist = "normal", scale = 2, verbose = FALSE)

The simulated data in dat are well-studied in GAM-related research and contain a number of covariates — labelled x0 through x3 — which have, to varying degrees, non-linear relationships with the response. We want to try to recover these relationships by approximating the true relationships between covariate and response using splines. To fit a purely additive model, we use

mod <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat, method = "REML")

mgcv provides a summary() method that is used to extract information about the fitted GAM

summary(mod)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.7625     0.0959   80.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
        edf Ref.df      F p-value    
s(x0) 3.528  4.370  11.30 4.2e-09 ***
s(x1) 2.662  3.310 129.02 < 2e-16 ***
s(x2) 8.146  8.799  84.72 < 2e-16 ***
s(x3) 1.001  1.002   0.00   0.987    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.763   Deviance explained = 77.2%
-REML = 850.87  Scale est. = 3.6785    n = 400

and the k.check() function for checking whether sufficient numbers of basis functions were used in each smooth in the model. (You may not have used k.check() directly — it is called by gam.check() which prints out other diagnostics and also produces four model diagnostic plots, which is one thing that gratia provides a replacement for.)

Plotting smooths

To visualize estimated GAMs, mgcv provides the plot.gam() method and the vis.gam() function. gratia currently provides a ggplot2-based replacement for plot.gam(). Work is on-going to provide vis.gam()-like functionality within gratia — see ?gratia::data_slice for early work in that direction. In gratia, we use the draw() generic to produce ggplot2-like plots from objects. To visualize the four estimated smooth functions in the GAM mod, we would use

draw(mod)

The result of draw(mod) is a plot of each of the four smooth functions in the mod GAM.

Internally draw() uses the plot_grid() function from cowplot to draw multiple panels on the plot device, and to line up the individual plots.

There’s not an awful lot more you can do with this now, but the at least the plot is reasonably pretty. gratia includes tools for working with the underlying smooths represented in mod, and if you wanted to extract most of the data used to build the plot you’d use the evaluate_smooth() function.

evaluate_smooth(mod, "x1")

# A tibble: 100 x 5
   smooth fs_variable       x1   est    se
   <chr>  <fct>          <dbl> <dbl> <dbl>
 1 s(x1)  <NA>        0.000565 -2.75 0.294
 2 s(x1)  <NA>        0.0106   -2.72 0.277
 3 s(x1)  <NA>        0.0207   -2.68 0.261
 4 s(x1)  <NA>        0.0308   -2.64 0.245
 5 s(x1)  <NA>        0.0409   -2.60 0.230
 6 s(x1)  <NA>        0.0510   -2.56 0.217
 7 s(x1)  <NA>        0.0610   -2.52 0.204
 8 s(x1)  <NA>        0.0711   -2.48 0.193
 9 s(x1)  <NA>        0.0812   -2.44 0.183
10 s(x1)  <NA>        0.0913   -2.40 0.173
# ... with 90 more rows

Producing diagnostic plots

The diagnostic plots currently produced by gam.check() can also be produced using gratia, with the appraise() function

appraise(mod)

The result of appraise(mod) is an array of four diagnostics plots, including a Q-Q plot (top left) and histogram (bottom left) of model residuals, a plot of residuals vs the linear predictor (top right), and a plot of observed vs fitted values.

Each of the four plots is produced via user-accessible function that implements a specific plot. For example, qq_plot(mod) produces the Q-Q plot in the upper left for the figure above, and the qq_plot.gam() method reproduces most of the functionality of mgcv::qq.gam(), including the direct randomization procedure (method = ‘direct’, as shown above) and the data simulation procedure (method = ‘simulate’) to generate reference quantiles, which typically have better performance for GLM-like models (Augustin et al., 2012).

qq_plot(mod, method = 'simulate')

The result of qq_plot(mod, method = ‘simulate’, fig.width = 6, fig.height = 4) is a Q-Q plot of residuals, where the reference quantiles are derived by simulating data from the fitted model.

draw() can also handle many of the more specialized smoothers currently available in mgcv. For example, 2D smoothers are represented as geom_raster() surfaces with contours

set.seed(1)
dat <- gamSim(2, n = 4000, dist = "normal", scale = 1, verbose = FALSE)
mod <- gam(y ~ s(x, z, k = 30), data = dat$data, method = "REML")
draw(mod)

The default way a 2D smoother is plotted using draw().

and factor-smooth-interaction terms, which are the equivalent of random slopes and intercepts for splines, are drawn on a single panel and colour is used to distinguish the different random smooths

## simulate example... from ?mgcv::factor.smooth.interaction
set.seed(0)
## simulate data...
f0 <- function(x) 2 * sin(pi * x)
f1 <- function(x, a=2, b=-1) exp(a * x)+b
f2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^10
n <- 500
nf <- 10
fac <- sample(1:nf, n, replace=TRUE)
x0 <- runif(n)
x1 <- runif(n)
x2 <- runif(n)
a <- rnorm(nf) * .2 + 2;
b <- rnorm(nf) * .5
f <- f0(x0) + f1(x1, a[fac], b[fac]) + f2(x2)
fac <- factor(fac)
y <- f + rnorm(n) * 2

df <- data.frame(y = y, x0 = x0, x1 = x1, x2 = x2, fac = fac)
mod <- gam(y~s(x0) + s(x1, fac, bs="fs", k=5) + s(x2, k=20),
           method = "ML")
draw(mod)

The result of draw(mod) for a more complex GAM containing a factor-smooth-interaction term with bs = ‘fs’.

What else can gratia do?

Although still quite early in the planned development cycle, gratia can handle most of the smooths that mgcv can estimate, including by variable smooths with factor and continuous by variables, random effect smooths (bs = ‘re’), 2D tensor product smooths, and models with parametric terms.

Smoothers that gratia can’t do anything with as yet are Markov random fields (MRFs; bs = ‘mrf’), splines on the sphere (SoSs; bs = ‘sos’), soap film smoothers (bs = ‘so’), and linear functional models with matrix terms.

The package also includes functions for

calculating across-the-function and simultaneous confidence intervals for smooths via confint() methods, and
calculating first and second derivatives (of currently only univariate) smooths using finite differences. fderiv() is the old home for first derivatives of GAM smooths, whilst the new derivatives() function can calculate first and second derivatives using forward (like fderiv()) as well as backward and central finite differences.

There is also are lot of exported functions that make it easier for working with GAMs fitted by mgcv and for extracting aspects of the fitted model and the smooths. The exact functionality is still being worked on so be prepared for some of the functions to come and go or change name as I work through ideas and implementations and settle on the interface for the tools that gratia will provide for this.

What can’t gratia do?

I’ve already covered where gratia is currently lacking in respect to the types of smoother that mgcv can fit. It is also currently lacking in tools for exploring models in more detail, such as the plots of model predictions over slices of covariate space that vis.gam() can produce (though see gratia::data_slice() for functions to create the data needed for such plots.) Nor can gratia currently handle smooths of higher than two dimensions. I’d like to add this capability soon as it will make visualizing GAMs fitted to spatio-temporal data much easier then it currently is.

The future?

Longer term I plan to fill out the types of smoothers that gratia can handle to cover all the types that mgcv can fit, and add vis.gam()-like functionality and the ability to handle higher dimensional smooths (plot.gam() can now handle 3- or 4-dimensional smooths.)

The ultimate goal of course is to just have draw() work for whatever GAM model you throw at it, and at least have feature parity with plot.gam() and vis.gam().

As is to be expected for such an early release, there is a lot of stabilization to function names and arguments that needs to happen in gratia, and a lot of documentation to be written, including some vignettes. For now, the best way to understand what gratia is doing or how it works is to look at the examples on the gratia website (built using pkgdown) and take a look at the package tests which contain lots of examples of GAM fits and the code to work with them.

I’m very much interested in user feedback, so please do let me know if you have any suggestions for additions or improvements to gratia, and if you do use gratia and find bugs in the package or GAMs that gratia can’t handle I would love to hear from you. You can get in touch via the comments below, or via GitHub Issues.

I would also be remiss if I did not mention Matteo Fasiolo’s excellent mcgViz package, which already has extensive capabilities for exploring GAM fits, including some very interesting approaches to handling models of millions of data points or more, which cause data visualization problems.

References

Augustin, N. H., Sauleau, E.-A., and Wood, S. N. (2012). On quantile quantile plots for generalized linear models. Computational statistics & data analysis 56, 2404–2409. doi:10.1016/j.csda.2012.01.026.

Controls on subannual variation in pCO₂ in productive hardwater lakes

2018-10-15T19:00:00+02:00

This year is looking like a bumper year for papers from the lab and collaborations, past and ongoing. Over the summer hiatus three papers came out online in their version-of-record form. The first of these was a paper on work that Emma Wiik, a former postdoc in my lab and Peter Leavitt’s lab, conducted to further our research on the controls on CO₂ exchange between lakes and the atmosphere.

Lakes play an important role in processing terrestrial carbon and influence carbon fluxes at the global scale. Unpacking the detail of the respective controls on CO₂ exchange with the atmosphere is an active and productive area of limnological research. In 2015, we published (Finlay et al., 2015) an analysis of time series data of CO₂ flux from hardwater prairie lakes, which showed that as these lakes warmed due to climate change, the efflux of CO₂ from the lakes actually decreased. This result was contrary to those observed in northern Boreal lakes, and reflects the need to study a range of lake types when generalizing from individual research projects to global scale assessments of the role of lakes in the carbon cycle.

Emma’s paper (Wiik et al., 2018), which was published in Journal of Geophysical Research: Biogeosciences in May, took a closer look than the 2015 paper at the controls on CO₂ exchange. Across the six Qu’Appelle lakes in the the 2015 study, we’d focused on trends in pH and CO₂ flux and the control of annual CO₂ flux by ice-cover duration, yielding results that spoke to the multi-annual to decadal scale relationships between CO₂ exchange and the important drivers. In the new paper, we used generalized additive models (GAMs) to model the full 18-year time series of limnological data.

Two GAMs were fitted and described in the paper. The first modelled CO₂ flux as a smooth function of lake pH over all six lakes, allowing for lake-specific effects of pH on CO₂ as well as accounting for change over time. Our CO₂ data were not directly measured, instead being calculated from geochemical equations, including pH. Hence this first model was simply to quantify how much of the variation in CO₂ we could explain using pH. As the latter was used to calculate the former, the explained variation was high, but never equal to 1.

Having established that pH was the primary control on CO₂ exchange in the six study lakes we wanted to try to model the lake water pH observations using a series of selected climatic and metabolic variables, chosen to reflect the major factors thought to control CO₂ exchange. A second GAM was fitted with pH as the response variable and lake specific smooth functions of the metabolic and climatic variables.

Through the second GAM, we were able to show that in the six Qu’Appelle study lakes that metabolic drivers of CO₂ flux we more important at the daily–monthly scale than climatic drivers, while the latter were more important at the interannual scale.

Figure 4 from the paper. (a–c) GAM partial effect splines for significant metabolic variables. Dotted lines: means of (y) and (x); Shaded area: middle 90% of all observations. Rug: data points. (a) GAM splines for chlorophyll a, with lakes with significantly different splines to the global spline indicated by color/hue and linetype. (b) GAM spline of oxygen, with standard errors indicated by shading. (c) GAM spline of dissolved organic carbon, with standard errors indicated by shading.

The paper is available from the journal website or via a preprint if you do not have access to Journal of Geophysical Research: Biogeosciences.

References

Finlay, K., Vogt, R. J., Bogard, M. J., Wissel, B., Tutolo, B. M., Simpson, G. L., et al. (2015). Decrease in CO2 efflux from northern hardwater lakes with increasing atmospheric warming. Nature 519, 215–218. doi:10.1038/nature14172.

Wiik, E., Haig, H. A., Hayes, N. M., Finlay, K., Simpson, G. L., Vogt, R. J., et al. (2018). Generalized additive models of climatic and metabolic controls of subannual variation in pCO 2 in productive hardwater lakes. Journal of Geophysical Research: Biogeosciences 123, 1940–1959. doi:10.1029/2018JG004506.

Summer hiatus

2018-10-15T15:00:00+02:00

It’s been quite some time since I last posted anything here. Mostly this was due to a very busy schedule since May that included teaching an online stats course, attending & presenting at three conferences, giving workshops at two of those conferences, and taking some well-earned vacation in Europe. Summer was also a busy time for manuscripts moving through the pipeline to being accepted and published. One thing I had hoped to do with the blog this year was publicize some of the work I do a little more. So, as normal service resumes here I hope to post some short pieces highlighting new papers that came out over the summer, and a few of these will be coming out over the next week or two.

One of the reasons for having this blog in the first place was to get me back into “writing mode”; I find it difficult at times, especially when the to-do list is long, to force myself to carve out time to both think and write. And as I get more and more out of practice writing, it takes more and more time to start or pick up work on manuscripts describing new results, and the words don’t flow easily at all. I find it much easier to write when I am towards the end of a writing period because I’ve literally forced myself to write. And, whilst blog posts aren’t the same kind of writing as for manuscripts, I hope that by just doing a little writing each week, it’ll be that bit easier to pick up work on a languishing manuscript or start something new.

Let’s see how I get on…

Fitting GAMs with brms: part 1

2018-04-21T12:00:00+02:00

Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the mgcv package. I use these models all the time in my research but recently we’ve been hitting the limits of the range of models that mgcv can fit. So I’ve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The brms package (Bürkner, 2017) is an excellent resource for modellers, providing a high-level R front end to a vast array of model types, all fitted using Stan. brms is the perfect package to go beyond the limits of mgcv because brms even uses the smooth functions provided by mgcv, making the transition easier. In this post I take a look at how to fit a simple GAM in brms and compare it with the same model fitted using mgcv.

In this post we’ll use the following packages. If you don’t know schoenberg, it’s a package I’m writing to provide ggplot versions of plots that can be produced by mgcv from fitted GAM objects. schoenberg is in early development, but it currently works well enough to plot the models we fit here. If you’ve never come across this package before, you can install it from Github using devtools::install_github(‘gavinsimpson/schoenberg’)

## packages
library('mgcv')
library('brms')
library('ggplot2')
library('schoenberg')
theme_set(theme_bw())

To illustrate brms’s GAM-fitting chops, we’ll use the mcycle data set that comes with the MASS package. It contains a set of measurements of the acceleration force on a rider’s head during a simulated motorcycle collision and the time, in milliseconds, post collision. The data are loaded using data() and we take a look at the first few rows

## load the example data mcycle
data(mcycle, package = 'MASS')

## show data
head(mcycle)

  times accel
1   2.4   0.0
2   2.6  -1.3
3   3.2  -2.7
4   3.6   0.0
5   4.0  -2.7
6   6.2  -2.7

The aim is to model the acceleration force (accel) as a function of time post collision (times). The plot below shows the data.

ggplot(mcycle, aes(x = times, y = accel)) +
    geom_point() +
    labs(x = "Miliseconds post impact", y = "Acceleration (g)",
         title = "Simulated Motorcycle Accident",
         subtitle = "Measurements of head acceleration")

We’ll model acceleration as a smooth function of time using a GAM and the default thin plate regression spline basis. This can be done using the gam() function in mgcv and, for comparison with the fully bayesian model we’ll fit shortly, we use `method = “REML” to estimate the smoothness parameter for the spline in mixed model form using REML

m1 <- gam(accel ~ s(times), data = mcycle, method = "REML")
summary(m1)

Family: gaussian 
Link function: identity 

Formula:
accel ~ s(times)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -25.546      1.951  -13.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
           edf Ref.df    F p-value    
s(times) 8.625  8.958 53.4  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.783   Deviance explained = 79.7%
-REML = 616.14  Scale est. = 506.35    n = 133

As we can see from the model summary, the estimated smooth uses about 8.5 effective degrees of freedom and in the test of zero effect, the null hypothesis is strongly rejected. The fitted spline explains about 80% of the variance or deviance in the data.

To plot the fitted smooth we could use the plot() method provided by mgcv, but this uses base graphics. Instead we can use the draw() method from schoenberg, which can currently handle most of the univariate smooths in mgcv plus 2-d tensor product smooths

draw(m1)

The equivalent model can be estimated using a fully-bayesian approach via the brm() function in the brms package. In fact, brm() will use the smooth specification functions from mgcv, making our lives much easier. The major difference though is that you can’t use te() or ti() smooths in brm() models; you need to use t2() tensor product smooths instead. This is because the smooths in the model are going to be treated as random effects and the model is estimated as a GLMM, which exploits the duality of splines as random effects. In this representation, the wiggly parts of the spline basis are treated as a random effect and their associated variance parameter controls the degree of wiggliness of the fitted spline. The perfectly smooth parts of the basis are treated as a fixed effect. In this form, the GAM can be estimated using standard GLMM software; it’s what allows the gamm4() function to fit GAMMs using the lme4 package for example. This is also the reason why we can’t use te() or ti() smooths; those smooths do not have nicely separable penalties which means they can’t be written in the form required to be fitted using typical mixed model software.

The brm() version of the GAM is fitted using the code below. Note that I have changed a few things from their default values as

the model required more than the default number of MCMC samples — iter = 4000,
the samples needed thinning to deal with some strong autocorrelation in the Markov chains — thin = 10,
the adapt.delta parameter, a tuning parameter in the NUTS sampler for Hamiltonian Monte Carlo, potentially needed raising — there was a warning about a potential divergent transition but I should have looked to see if it was one or not; instead I just increased the tuning parameter to 0.99,
four chains fitted by default but I wanted these to be fitted using 4 CPU cores,
seed sets the internal random number generator seed, which allows reproducibility of models, and
for this post I didn’t want to print out the progress of the sampler — refresh = 0 — typically you won’t want to do this so you can see how sampling is progressing.

The rest of the model is pretty similar to the gam() version we fitted earlier. The main difference is that I use the bf() function to create a special brms formula specifying the model. You don’t actually need to do this for such a simple model, but in a later post we’ll use this to fit distributional GAMs. Note that I’m leaving all the priors in the model at the default values. I’ll look at defining priors in a later post; for now I’m just going to use the default priors that brm() uses

m2 <- brm(bf(accel ~ s(times)),
          data = mcycle, family = gaussian(), cores = 4, seed = 17,
          iter = 4000, warmup = 1000, thin = 10, refresh = 0,
          control = list(adapt_delta = 0.99))

Compiling the C++ model

Start sampling

Once the model has finished compiling and sampling we can output the model summary

summary(m2)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: accel ~ s(times) 
   Data: mcycle (Number of observations: 133) 
Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 10; 
         total post-warmup samples = 1200
    ICs: LOO = NA; WAIC = NA; R2 = NA
 
Smooth Terms: 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sds(stimes_1)   722.44    198.12   450.17  1150.27       1180 1.00

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   -25.54      2.02   -29.66   -21.50       1200 1.00
stimes_1     16.10     38.20   -61.46    90.91       1171 1.00

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma    22.78      1.47    19.94    25.68       1200 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

This outputs details of the model fitted plus parameter estimates (as posterior means), standard errors, (by default) 95% credible intervals and two other diagnostics:

Eff.Sample is the effective sample size of the posterior samples in the model, and
Rhat is the potential scale reduction factor or Gelman-Rubin diagnostic and is a measure of how well the chains have converged and ideally should be equal to 1.

The summary includes two entries for the smooth of times:

sds(stimes_1) is the variance parameter, which has the effect of controlling the wiggliness of the smooth — the larger this value the more wiggly the smooth. We can see that the credible interval doesn’t include 0 so there is evidence that a smooth is required over and above a linear parametric effect of times, details of which are given next,
stimes_1 is the fixed effect part of the spline, which is the linear function that is perfectly smooth.

The final parameter table includes information on the variance of the data about the conditional mean of the response.

How does this model compare with the one fitted using gam()? We can use the gam.vcomp() function to compute the variance component representation of the smooth estimated via gam(). To make it comparable with the value shown for the brms model, we don’t undo the rescaling of the penalty matrix that gam() performs to help with numeric stability during model fitting.

gam.vcomp(m1, rescale = FALSE)

Standard deviations and 0.95 confidence intervals:

           std.dev     lower      upper
s(times) 807.88726 480.66162 1357.88215
scale     22.50229  19.85734   25.49954

Rank: 2/2

This gives a posterior mean of 807.89 with 95% confidence interval of 480.66–1357.88, which compares well with posterior mean and credible interval of the brm() version of 722.44 (450.17 – 1150.27).

The marginal_smooths() function is used to extract the marginal effect of the spline.

msms <- marginal_smooths(m2)

This function extracts enough information about the estimated spline to plot it using the plot() method

plot(msms)

Given the similarity in the variance components of the two models it is not surprising the two estimated smooth also look similar. The marginal_smooths() function is effectively the equivalent of the plot() method for mgcv-based GAMs.

There’s a lot that we can and should do to check the model fit. For now, we’ll look at two posterior predictive check plots that brms, via the bayesplot package (Gabry and Mahr, 2018), makes very easy to produce using the pp_check() function.

pp_check(m2)

Using 10 posterior samples for ppc type 'dens_overlay' by default.

The default produces a density plot overlay of the original response values (the thick black line) with 10 draws from the posterior distribution of the model. If the model is a good fit to the data, samples of data sampled from it at the observed values of the covariate(s) should be similar to one another.

Another type of posterior predictive check plot is the empirical cumulative distribution function of the observations and random draws from the model posterior, which we can produce with type = “ecdf_overlay”

pp_check(m2, type = "ecdf_overlay")

Using 10 posterior samples for ppc type 'ecdf_overlay' by default.

Both plots show significant deviations between the the posterior simulations and the observed data. The poor posterior predictive check results are in large part due to the non-constant variance of the acceleration data conditional upon the covariate. Both models assumed that the observation are distributed Gaussian with means equal to the fitted values (estimated expectation of the response) with the same variance (^2). The observations appear to have different variances, which we can model with a distributional model, which allow all parameters of the distribution of the response to be modelled with linear predictors. We’ll take a look at these models in a future post.

References

Bürkner, P.-C. (2017). brms: An R package for bayesian multilevel models using Stan. Journal of Statistical Software 80, 1–28. doi:10.18637/jss.v080.i01.

Gabry, J., and Mahr, T. (2018). Bayesplot: Plotting for bayesian models. Available at: https://CRAN.R-project.org/package=bayesplot.

Comparing smooths in factor-smooth interactions II

2017-12-14T17:00:00+01:00

In a previous post I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using s()’s by argument. When a common-or-garden factor variable is passed to by, gam() estimates a separate smooth for each level of the by factor. Using the (Xp) matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the by variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an ordered factor for the by variable.

When an ordered factor is passed to by, mgcv does something quite different to the model I described previously, although the end results should be similar. What mgcv does in the ordered factor case is to fit (L-1) difference smooths, where (l = 1, , L) are the levels of the factor and (L) the number of levels. These smooths model the difference between the smooth estimated for the reference level and the (l)th level of the factor. Additionally, the by variable smooth doesn’t itself estimate the smoother for the reference level; so we are required to add a second smooth to the model that estimates that particular smooth.

In pseudo code our model would be something like, for ordered factor of,

model <- gam(y ~ of + s(x) + s(x, by = of), data = df)

As with any by factor smooth we are required to include a parametric term for the factor because the individual smooths are centered for identifiability reasons. The first s(x) in the model is the smooth effect of x on the reference level of the ordered factor of. The second smoother, s(x, by = of) is the set of (L-1) difference smooths, which model the smooth differences between the reference level smoother and those of the individual levels (excluding the reference one).

Note that this model still estimates a separator smoother for each level of the ordered factor, it just does it in a different way. The smoother for the reference level is estimated via contribution from s(x) only, whilst the smoothers for the other levels are formed from the additive combination of s(x) and the relevant difference smoother from the set created by s(x, by = of). This is analogous to the situation we have when estimating an ANOVA using the default contrasts and lm(); the intercept is then an estimate of the mean response for the reference level of the factor, and the remaining model coefficients estimate the differences between the mean response of the reference level and that of the other factor levels.

This ordered-factor-smooth interaction is most directly applicable to situations where you have a reference category and you are interested in difference between that category and the other levels. If you are interested in pair-wise comparison of smooths you could use the ordered factor approach — it may be more parsimonious than estimating separate smoothers for each level — but you will still need to post-process the results in a manner similar to that described in the previous post¹.

To illustrate the ordered factor difference smooths, I’ll reuse the example from the Geochimica paper I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner (Rose et al., 2012), and which formed the basis for the previous post.

Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and non-eroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesized, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the paper — if you’re interested but don’t have access to the journal, send me an email and I’ll pass on a pdf.

Below I make use of the following packages

readr
dplyr
ggplot2, and
mgcv

You’ll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again

library('readr')
library('dplyr')
library('ggplot2')
theme_set(theme_bw())
library('mgcv')

Next, load the data set and convert the SiteCode variable to a factor

uri <- 'https://gist.githubusercontent.com/gavinsimpson/eb4ff24fa9924a588e6ee60dfae8746f/raw/geochimica-metals.csv'
metals <- read_csv(uri, skip = 1, col_types = c('ciccd'))
metals <- mutate(metals, SiteCode = factor(SiteCode))

This is a subset of the data used in Rose et al. (2012) — the Hg concentrations in the sediments for just three of the lochs are included here in the interests of simplicity. The data set contains 5 variables

metals

# A tibble: 44 x 5
   SiteCode  Date SoilType Region        Hg
     <fctr> <int>    <chr>  <chr>     <dbl>
 1     CHNA  2000     thin     NW  3.843399
 2     CHNA  1990     thin     NW  5.424618
 3     CHNA  1980     thin     NW  8.819730
 4     CHNA  1970     thin     NW 11.417457
 5     CHNA  1960     thin     NW 16.513540
 6     CHNA  1950     thin     NW 16.512047
 7     CHNA  1940     thin     NW 11.188840
 8     CHNA  1930     thin     NW 11.622222
 9     CHNA  1920     thin     NW 13.645853
10     CHNA  1910     thin     NW 11.181711
# ... with 34 more rows

SiteCode is a factor indexing the three lochs, with levels CHNA, FION, and NODH,
Date is a numeric variable of sediment age per sample,
SoilType and Region are additional factors for the (natural) experimental design, and
Hg is the response variable of interest, and contains the Hg concentration of each sediment sample.

Neil gave me permission to make these data available openly should you want to try this approach out for yourself. If you make use of the data for other purposes, please cite the source publication (Rose et al., 2012) and recognize the contribution of the data creators; Handong Yang, Simon Turner, and Neil Rose.

To proceed, we need to create an ordered factor. Here I’m going to use the SoilType variable as that is easier to relate to conditions of the soil (rather than the Site Code I used in the previous post). I set the non-eroded level to be the reference and as such the GAM will estimate a full smooth for that level and then smooth differences between the non-eroded, and each of the eroded and thin lakes.

metals <- mutate(metals,
                 oSoilType = ordered(SoilType, levels = c('non-eroded','eroded','thin')))

The ordered-factor GAM is fitted to the three lochs using the following

m <- gam(Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType), data = metals,
         method = 'REML')

and the resulting smooths can be drawn using the plot() method

plot(m, shade = TRUE, pages = 1, scale = 0, seWithMean = TRUE)

Estimated smooth trend for the non-eroded site (top, left), and difference smooths reflecting estimated differences between the non-eroded site and the eroded site (top, right) and thin soil site (bottom, left), respectively.

The smooth in the top left is the reference smooth trend for the non-eroded site. The other two smooths are the difference smooths between the non-eroded and eroded sites (top right).

It is immediately clear that the difference between the non-eroded and eroded sites is not significant under this model. The estimated difference is linear, which suggests the trend in the eroded site is stronger than the one estimated for the non-eroded site. However, this difference is not so large as to be an identifiably different trend.

The difference smooth for the thin soil site is considerably different to that estimated for the non-eroded site; the principal difference being the much reduced trend in the thin soil site, as indicated by the difference smooth acting in opposition to the estimated trend for the non-eroded site.

A nice feature of the ordered factor approach is that inference on these difference can be performed formally and directly using the summary() output of the estimated GAM

summary(m)

Family: gaussian 
Link function: identity 

Formula:
Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.2231     0.6789  19.478  < 2e-16 ***
oSoilType.L  -1.6948     1.1608  -1.460  0.15399    
oSoilType.Q  -4.2847     1.1990  -3.573  0.00114 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                          edf Ref.df      F  p-value    
s(Date)                 4.843  5.914 10.862 2.67e-07 ***
s(Date):oSoilTypeeroded 1.000  1.000  0.471    0.498    
s(Date):oSoilTypethin   3.047  3.779 10.091 1.84e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.76   Deviance explained = 82.1%
-REML =  126.5  Scale est. = 20.144    n = 44

The impression we formed about the differences in trends are reinforced with actual test statistics; this is a clear advantage of the ordered-factor approach if your problem suits this different from reference situation.

One feature to note, because we used an ordered factor, the parametric term for oSoilType uses polynomial contrasts: the .L and .Q refer to the linear and quadratic terms used to represent the factor. This is not as easy to identify differences in mean Hg concentration. If you want to retain that readily interpreted parameterisation, use the SoilType factor for the parametric part:

m <- gam(Hg ~ SoilType + s(Date) + s(Date, by = oSoilType), data = metals,
         method = 'REML')
summary(m)

Family: gaussian 
Link function: identity 

Formula:
Hg ~ SoilType + s(Date) + s(Date, by = oSoilType)

Parametric coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          16.722      1.213  13.788 4.88e-15 ***
SoilTypenon-eroded   -4.049      1.684  -2.405 0.022115 *  
SoilTypethin         -6.446      1.681  -3.835 0.000553 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                          edf Ref.df      F  p-value    
s(Date)                 4.843  5.914 10.862 2.67e-07 ***
s(Date):oSoilTypeeroded 1.000  1.000  0.471    0.498    
s(Date):oSoilTypethin   3.047  3.779 10.091 1.84e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.76   Deviance explained = 82.1%
-REML = 125.95  Scale est. = 20.144    n = 44

Now the output in the parametric terms section is easier to interpret yet we retain the behavior of the reference smooth plus difference smooths part of the fitted GAM.

References

Rose, N. L., Yang, H., Turner, S. D., and Simpson, G. L. (2012). An assessment of the mechanisms for the transfer of lead and mercury from atmospherically contaminated organic soils to lake sediments with particular reference to scotland, UK. Geochimica et cosmochimica acta 82, 113–135. doi:10.1016/j.gca.2010.12.026.

Except now you need to be sure to include the right set of basis functions that correspond to the pair of levels you want to compare. You can’t do that with the function I included in that post; it requires something a bit more sophisticated, but the principles are the same.↩

First steps with MRF smooths

2017-10-19T20:00:00+02:00

One of the specialist smoother types in the mgcv package is the Markov Random Field (MRF) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. I’ve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result I’ve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial Regional Smoothing in R from The Pudding. In this post I take a quick look at how to use the MRF smooth in mgcv to model a discrete spatial data set from the US Census Bureau.

In that tutorial, the example data are taken from the US Census Bureau via a shapefile prepared by the author. After a little munging — quite a few steps are missing from the tutorial — I managed to get data from the shapefile that matched what was used in the tutorial. The data are on county level percentages of US adults whose highest level of education attainment is a high school diploma. The raw data are shown in the figure below

To follow along, you’ll need to download the example shapefile provided by the author of the post on The Pudding. The shapefile(s) are in a ZIP, which I extracted into the working directory; the code below assumes this.

This post will make use of the following set of package; load them now, as shown below, and install any that you may be missing

library('rgdal')
library('proj4')
library('spdep')
library('mgcv')
library('ggplot2')
library('dplyr')
library('viridis')

Assuming you have extracted the shapefile, we load it into R using readOGR()

shp <- readOGR('.', 'us_county_hs_only')

and do some data munging

## select only mainland US counties
states <- c(1,4:6, 8:13, 16:42, 44:51, 53:56)
shp <- shp[shp$STATEFP %in% sprintf('%02i', states), ]
df <- droplevels(as(shp, 'data.frame'))

## project data
aea.proj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96"
shp <- spTransform(shp, CRS(aea.proj))  # project to Albers
shpf <- fortify(shp, region = 'GEOID')

## Need a proportion for fitting
df <- transform(df, hsd = hs_pct / 100)

The shapefile contains US Census Bureau data for all US counties, including many that are far from the continental USA. The tutorial from The Pudding doesn’t go into how they removed, or how they drew a map without these additional counties. For our purposes they may cause complications when we try to model them using the MRF smooth. I’m sure the modelling approach can handle data like this, but as I wanted to achieve something that followed the tutorial I’ve removed everything not linked to the continental US landmass, including (I’m sorry!), Alaska and Hawaii — my ggplot and mapping skills aren’t yet good enough to move Alaska and Hawaii to the bottom left of such maps.

The data were projected using the Albers equal area projection and subsequently passed to the fortify() method from ggplot2 to get a version of the county polygons suitable for plotting with that package.

Finally, I created a new variable hsd which is just the variable hs_pct divided by 100. This creates a proportion that we’ll need for model fitting as you’ll see shortly.

Before we can model these data with gam(), we need to create the supporting information that gam() will use to create the MRF smooth penalty. The penalty matrix in an MRF smooth is based on the neighbourhood structure of the observations. There are three ways to pass this information to gam()

as a list of polygons (not SpatialPolygons, I believe)
as a list containing the neighbourhood structure, or
the raw penalty matrix itself.

Options 1 and 3 aren’t easily doable as far I can see — gam() isn’t expecting the sort of object we created when we imported the shapefile and nobody want’s to build a penalty matrix by hand! Thankfully option 2, the neighbourhood structure is relatively easy to create. For that I use the poly2nb() function from the spdep package. This function takes a shapefile and works out which regions are neighbours of any other region by virtue of them sharing a border. To make sure everything matches up nicely in the way gam() wants this list, we specify that the region IDs should be the GEOIDs from the original data set (the GEOID uniquely identifies each county) and we have to set the names attribute on the neighbouthood list to match these unique IDs

nb <- poly2nb(shp, row.names = df$GEOID)
names(nb) <- attr(nb, "region.id")

The result of the previous chunk is a list whose names map on to the levels of the GEOID factor. The values in each element of nb index the elements of nb that are neighbours of the current element

str(nb[1:6])

List of 6
 $ 19107: int [1:6] 1417 1464 1632 2277 2278 2851
 $ 19189: int [1:6] 551 1414 2151 2452 2846 2849
 $ 20093: int [1:7] 5 557 1064 1142 1437 1441 2978
 $ 20123: int [1:5] 1469 1565 2648 2966 2977
 $ 20187: int [1:7] 3 554 1142 1441 1620 2142 2238
 $ 21005: int [1:7] 582 583 953 954 1770 1861 2169

With that done we can now fit the GAM. Fitting this is going to take a wee while (over 3 hours for the full rank MRF, using 6 threads, on a reasonably powerful 3-year old workstation with dual 4-core Xeon processors). To specify an MRF smooth we use the bs argument to the s() function, setting it to bs = ‘mrf’. The neighbourhood list is passed via the xt argument, which takes a list as a value; here we specify a component nb which takes our neighbourhood list nb. The final set-up variable to consider is whether to fit a full rank MRF, where a coefficient for each county will be estimated, or a reduced rank MRF, wherein the MRF is represented using fewer coefficients and counties are mapped to the smaller set of coefficients. The rank of the MRF smooth is set using the k argument. The default is to fit a full rank MRF, whilst setting k < NROW(data) will result ins a reduced-rank MRF being etimated.

The full rank MRF model is estimated using

ctrl <- gam.control(nthreads = 6) # use 6 parallel threads, reduce if fewer physical CPU cores
m1 <- gam(hsd ~ s(GEOID, bs = 'mrf', xt = list(nb = nb)), # define MRF smooth
          data = df,
          method = 'REML', # fast version of REML smoothness selection
          control = ctrl,
          family = betar()) # fit a beta regression

As the response is a proportion, the fitted GAM uses the beta distribution as the conditional distribution of the response. The default link in the logit, just as it is in for the binomial distribution, and insures that fitted values on the scale of the linear predictor are mapped onto the allowed range for proportions of 0–1.

The final model uses in the region of 1700 effective degrees of freedom. This is the smoothness penalty at work; rather than 3108 individual coefficients, the smoothness invoked to try to arrange for neighbouring counties to have similar coefficients has shrunk away almost half of the complexity implied by the full rank MRF.

summary(m1)

Family: Beta regression(179.532) 
Link function: logit 

Formula:
hsd ~ s(GEOID, bs = "mrf", xt = list(nb = nb))

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.63806    0.00283  -225.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df Chi.sq p-value    
s(GEOID) 1732   3107   9382  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.769   Deviance explained = 89.7%
-REML =  -4544  Scale est. = 1         n = 3108

Whilst the penalty enforces smoothness, further smoothness can be enforced by fitting a reduced rank MRF. In the next code block I fit models with k = 300 and k = 30 respectively, which imply considerable smoothing relative to the full rank model.

## rank 300 MRF
m2 <- gam(hsd ~ s(GEOID, bs = 'mrf', k = 300, xt = list(nb = nb)),
          data = df, method = 'REML', control = ctrl,
          family = betar())
## rank 30 MRF
m3 <- gam(hsd ~ s(GEOID, bs = 'mrf', k = 30, xt = list(nb = nb)),
          data = df, method = 'REML', control = ctrl,
          family = betar())

To visualise the different fits we need to generate predicted values on the response scale for each county and add this data to the county data df

df <- transform(df,
                mrfFull     = predict(m1, type = 'response'),
                mrfRrank300 = predict(m2, type = 'response'),
                mrfRrank30  = predict(m3, type = 'response'))

Before we can plot these fitted values we need to merge df with the fortified shapefile

## merge data with fortified shapefile
mdata <- left_join(shpf, df, by = c('id' = 'GEOID'))

Warning: Column `id`/`GEOID` joining character vector and factor, coercing
into character vector

To facilitate plotting with ggplot2 I begin by creating some fixed plot components, like the theme, scale, and labels

theme_map <- function(...) {
    theme_minimal() +
    theme(...,
          axis.line = element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          axis.title.x = element_blank(),
          axis.title.y = element_blank(),
          panel.border = element_blank())
}

myTheme <- theme_map(legend.position = 'bottom')
myScale <- scale_fill_viridis(name = '%', option = 'plasma',
                              limits = c(0.1, 0.55),
                              labels = function(x) x * 100,
                              guide = guide_colorbar(direction = "horizontal",
                              barheight = unit(2, units = "mm"),
                              barwidth = unit(75, units = "mm"),
                              title.position = 'left',
                              title.hjust = 0.5,
                              label.hjust = 0.5))
myLabs <- labs(x = NULL, y = NULL, title = 'US Adult Education',
               subtitle = '% of adults where high school diploma is highest level education',
               caption = 'Source: US Census Bureau')

I took many of these settings from Timo Grossenbacher’s excellent post on mapping regional demographic data in Switzerland.

Now we can plot the fitted proportions. Note that whilst we plot proportions, the colour bar labels are in percentages in keeping with the original data (see the definition for my_scale to see how this was achieved).

Fitted values from the full rank MRF are shown below

ggplot(mdata, aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = mrfFull)) +
    geom_path(col = 'black', alpha = 0.5, size = 0.1) +
    coord_equal() +
    myTheme + myScale + myLabs

This model expains about 90% of the deviance in the original data. Whilst some smoothing is evident, the fitted values show a considerable about of non-spatial variation. This is most likely due to not including important covariates, such as country average income, which might explain some of the finer scale structure; neighbouring counties with quite different proportions. A more considered analysis would include these and other relevant predictors alongside the MRF.

Smoother surfaces can be achieved via the reduced rank MRFs. First the rank 300 MRF

ggplot(mdata, aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = mrfRrank300)) +
    geom_path(col = 'black', alpha = 0.5, size = 0.1) +
    coord_equal() +
    myTheme + myScale + myLabs

and next the rank 30 MRF

ggplot(mdata, aes(x = long, y = lat, group = group)) +
    geom_polygon(aes(fill = mrfRrank30)) +
    geom_path(col = 'black', alpha = 0.5, size = 0.1) +
    coord_equal() +
    myTheme + myScale + myLabs

As can be clearly seen from the plots, the degree of smoothness can be controlled effectively via the k argument.

In a future post I’ll take a closer look at using MRFs alongside other covariates as part of a model complex spatial modeling exercise.

Comparing smooths in factor-smooth interactions I

2017-10-11T00:00:00+02:00

One of the really appealing features of the mgcv package for fitting GAMs is the functionality it exposes for fitting quite complex models, models that lie well beyond what many of us may have learned about what GAMs can do. One of those features that I use a lot is the ability to model the smooth effects of some covariate (x) in the different levels of a factor. Having estimated a separate smoother for each level of the factor, the obvious question is, which smooths are different? In this post I’ll take a look at one way to do this using by-variable smooths.

With mgcv, smooths are included in model formulae using the s() function. If you want to have the smooth equivalent of a continuous-factor interaction, one way to achieve this is via the by argument to s(). If you pass a factor to by, mgcv sets up the model matrix in such a way that you get a separate smoother for each level of the by factor. Each of these smoothers gets its own smoothness parameter — so you can fit a wiggly function in level foo and a smooth function in level bar, with each level’s function being learned from the data associated with that level.

I used this technique in a paper I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner (Rose et al., 2012). Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and non-eroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesised, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the paper — if you’re interested but don’t have access to the journal, send me an email and I’ll pass on a pdf.

It was relatively simple to fit splines to each sediment profile, but once I’d done this, how were we going to estimate the difference between the fitted trends? Thankfully, I already had the answer as Simon Wood had supplied code to do it to an OP on the R-Help listserver some years previous. That answer involved by-variable smoothers, which I was already using, and the use of the (Xp) matrix of the fitted GAM.

Readers of this blog will have heard about the (Xp) matrix before; it’s used a lot when we want to simulate from the posterior of the estimated model. Importantly, for our purposes, it allows for the creation of derived quantities, from the fitted model, and the assignment of uncertainty to those quantities.

In this post I’ll illustrate how to do the required comparison using some of the data from that study on Scottish lochs.

In this post I’ll use the the following packages

readr
dplyr
ggplot2, and
mgcv

You’ll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again

library('readr')
library('dplyr')
library('ggplot2')
theme_set(theme_bw())
library('mgcv')

Next, load the data set and convert the SiteCode variable to a factor for use in fitting the GAM with gam()

uri <- 'https://gist.githubusercontent.com/gavinsimpson/eb4ff24fa9924a588e6ee60dfae8746f/raw/geochimica-metals.csv'
metals <- read_csv(uri, skip = 1, col_types = c('ciccd'))
metals <- mutate(metals, SiteCode = factor(SiteCode))

metals

# A tibble: 44 x 5
   SiteCode  Date SoilType Region        Hg
     <fctr> <int>    <chr>  <chr>     <dbl>
 1     CHNA  2000     thin     NW  3.843399
 2     CHNA  1990     thin     NW  5.424618
 3     CHNA  1980     thin     NW  8.819730
 4     CHNA  1970     thin     NW 11.417457
 5     CHNA  1960     thin     NW 16.513540
 6     CHNA  1950     thin     NW 16.512047
 7     CHNA  1940     thin     NW 11.188840
 8     CHNA  1930     thin     NW 11.622222
 9     CHNA  1920     thin     NW 13.645853
10     CHNA  1910     thin     NW 11.181711
# ... with 34 more rows

SiteCode is a factor indexing the three lochs, with levels CHNA, FION, and NODH,
Date is a numeric variable of sediment age per sample,
SoilType and Region are additional factors for the (natural) experimental design, and
Hg is the response variable of interest, and contains the Hg concentration of each sediment sample.

The data, with LOESS smoothers superimposed, are shown below

ggplot(metals, aes(x = Date, y = Hg, colour = SiteCode)) +
    geom_point() +
    geom_smooth(method = 'loess', se = FALSE) +
    scale_colour_brewer(type = 'qual', palette = 'Dark2') +
    theme(legend.position = 'top')

Smooth-factor interactions can be estimated using gam() in a number of different ways. Here we use by-variable smooths. Each of the separate smooths is subject to identifiability constraints, which effectively centres each smooth around zero effect. As such, differences in the mean Hg concentrations of the lochs is not accounted for by the smooths. The rectify this we’ll need to add SiteCode as a parametric term to the model, along with the smooths.

The GAM is fitted to the three sites, and the fit summarized, using the following code

m <- gam(Hg ~ SiteCode + s(Date, by = SiteCode), data = metals)
summary(m)

Family: gaussian 
Link function: identity 

Formula:
Hg ~ SiteCode + s(Date, by = SiteCode)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.2970     0.7889  13.052 2.19e-12 ***
SiteCodeFION   2.3260     1.1163   2.084 0.048026 *  
SiteCodeNODH   5.5587     1.3288   4.183 0.000332 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                       edf Ref.df      F  p-value    
s(Date):SiteCodeCHNA 2.744  3.412  3.786   0.0187 *  
s(Date):SiteCodeFION 5.711  6.861 18.745 7.66e-12 ***
s(Date):SiteCodeNODH 8.574  8.922 19.086 7.62e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.889   Deviance explained = 93.8%
GCV = 17.076  Scale est. = 9.3029    n = 44

and the resulting smooths can be drawn using the plot() method

plot(m, shade = TRUE, pages = 1, scale = 0)

Estimated smooths for each level of factor SiteCode

Differences of smooths

To calculate the differences between pairs of the three smooths estimated in the model we need to be able to evaluate the smooths at a set of values of Date. Below we specify a fine gird of points over the time-scale of each core. This set of prediction data is passed to the predict() method and the (Xp) matrix is requested with the option type = ‘lpmatrix’

pdat <- expand.grid(Date = seq(1860, 2000, length = 400),
                    SiteCode = c('FION', 'CHNA', 'NODH'))
xp <- predict(m, newdata = pdat, type = 'lpmatrix')

The result, stored in xp, is a matrix where the basis functions of the model have been evaluated at the values of the covariates supplied to newdata. To turn this matrix into one containing fitted or predicted values it needs the be muliplied by the model coefficients and the rows summed. However, in this (Xp) state we can compute differences between the evaluated smooths before computing fitted values.

This process needs to be repeated for each pair of smooths we want to compare — this is a bit like all pair-wise post hoc comparisons. A number of steps are involved, which I break down below for the comparison of the smooths for SiteCode == ‘CHNA’ and SiteCode = ‘FION’. After I’ve gone through the steps, we’ll wrap them all into a function which we can use to automated the process.

The first step is to identify which columns of (Xp) relate to the smooths for the pair of levels of SiteCode we are comparing. The rows of the (Xp) that contain the data for this pair of lochs also need to be identified.

## which cols of xp relate to splines of interest?
c1 <- grepl('CHNA', colnames(xp))
c2 <- grepl('FION', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, SiteCode == 'CHNA')
r2 <- with(pdat, SiteCode == 'FION')

Next, we subtract the elements of (Xp) for the first loch from the elements of (Xp) for the second loch. To focus on the difference between the pair of smooths, the columns of the differenced (Xp) matrix (in X) that aren’t involved in comparison are set then to zero

## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0

The first zeroing uses the logical indices for columns containing either ‘CHNA’ or ‘FION’ — if you had a model with additional smooths involving the SiteCode variable, you’d need a more sophisticated way of identifying the columns of (Xp) that relate to the smooths of interest. The second zeroing affects all the columns related to the parametric terms in the model. For this model these relate to the intercept and the two dummy contrasts associated with SiteCode in the model.

Having obtained a suitably modified (Xp) matrix, predicted values using it can be obtained by multiplying the matrix by the estimated model coefficients and summing the result row-wise. This can be achieved in a single step using a matrix multiplication of the matrix X with the row vector of model coefficients.

dif <- X %*% coef(m)

Because we zeroed out all the columns not involved directly in the pair of smooths we are comparing, this effectively turns their contributions to the fitted/predicted values to zero also. The result, stored in dif, is a vector of fitted differences between the pair of smooths we an interested in.

Having computed the difference, we want to know how uncertain the estimated difference is. Handily, we can compute the standard errors of the differences using the variance-covariance matrix of the estimated model coefficients. The standard errors are computed using

se <- sqrt(rowSums((X %*% vcov(m)) * X))

Note that the above assumes that smoothness parameters (which control how wiggly the individual smooths are) are known and fixed. In reality these smoothness parameters were estimated and hence the standard errors just computed are likely biased low. This could be corrected by passing unconditional = TRUE to vcov().

Now that we have standard errors, a point-wise 1 - () confidence interval can be created using the critical value of the t distribution with appropriate degrees of freedom (in the case of a Gaussian model; quantiles of the Gaussian distribution would be needed for other conditional distributions). For a 95% interval, we use the following code

crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)

To allow for these steps to be repeated for all pairwise combinations, the process outlined above is best encapsulated as a function. One such function is shown below, where arguments f1, f2, and var refer to length 1 character vectors specifying the first and second levels of the factor and the name of the by-variable factor respectively.

smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
                        unconditional = FALSE) {
    xp <- predict(model, newdata = newdata, type = 'lpmatrix')
    c1 <- grepl(f1, colnames(xp))
    c2 <- grepl(f2, colnames(xp))
    r1 <- newdata[[var]] == f1
    r2 <- newdata[[var]] == f2
    ## difference rows of xp for data from comparison
    X <- xp[r1, ] - xp[r2, ]
    ## zero out cols of X related to splines for other lochs
    X[, ! (c1 | c2)] <- 0
    ## zero out the parametric cols
    X[, !grepl('^s\\(', colnames(xp))] <- 0
    dif <- X %*% coef(model)
    se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
    crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
    upr <- dif + (crit * se)
    lwr <- dif - (crit * se)
    data.frame(pair = paste(f1, f2, sep = '-'),
               diff = dif,
               se = se,
               upper = upr,
               lower = lwr)
}

To complete the pairwise comparison of the estimated smooths, we use the function on the three combinations of pairs of smooths and gather the results into a tidy object comp suitable for plotting with ggplot2

comp1 <- smooth_diff(m, pdat, 'FION', 'CHNA', 'SiteCode')
comp2 <- smooth_diff(m, pdat, 'FION', 'NODH', 'SiteCode')
comp3 <- smooth_diff(m, pdat, 'CHNA', 'NODH', 'SiteCode')
comp <- cbind(date = seq(1860, 2000, length = 400),
              rbind(comp1, comp2, comp3))

The pairwise differences of smooths and associated confidence intervals can be plotted using

ggplot(comp, aes(x = date, y = diff, group = pair)) +
    geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
    geom_line() +
    facet_wrap(~ pair, ncol = 2) +
    coord_cartesian(ylim = c(-30,30)) +
    labs(x = NULL, y = 'Difference in Hg trend')

Estimated differences of trends in sediment Hg concentration for pairs of Scottish lochs

Where the confidence interval excludes zero, we might infer significant differences between a pari of estimated smooths.

Conclusions

Regular readers will be familiar with the (Xp) matrix; I’ve used this for simulating from the posterior distribution of an estimated GAM, and for computing simultaneous intervals for smoothers, among other things. Here, it is used to compute difference between smooths. The (Xp) matrix is quite versatile; learning how to use it effectively will allow you to compute all manner of derived quantities related to an estimated GAM.

The by-variable type of factor-smooth interaction is just one of the ways of estimating different smooth effects for each level of a factor. One of the potential disadvantages of this type of smoother is it is quite wasteful to estimate three different smooths, each with its own smoothness parameter. More parsimonious ways of fitting factor-smooth interactions are possible with mgcv, and I’ll look at an alternative option in the next post.

References

Fitting count and zero-inflated count GLMMs with mgcv

2017-05-04T13:45:00+02:00

A couple of days ago, Mollie Brooks and coauthors posted a preprint on BioRχiv illustrating the use of the glmmTMB R package for fitting zero-inflated GLMMs (Brooks et al., 2017). In the paper, glmmTMB is compared with several other GLMM-fitting packages. mgcv has recently gained the ability to fit a wider range of families beyond the exponential family of distributions, including zero-inflated Poisson models. mgcv can also fit simple GLMMs through a spline equivalent of a Gaussian random effect. So, whilst I was waiting on some Bayesian GAMs to finish sampling, I decided to see how mgcv compared against glmmTMB on the two examples used in the paper.

For this post I’ll be using a couple of packages beyond glmmTMB and mgcv; make sure you have ggplot2 and ggstance installed if you wish to run through the code below.

library("glmmTMB")
library("mgcv")
library("ggplot2")
theme_set(theme_bw())
library("ggstance")

There are several ways in which mgcv allows GLMMs to be fitted, but the way that interests me here is via gam() and the random effect spline basis. Penalised splines of the type provided in mgcv can also be represented in mixed model form, such that GAMs can also be fitted using mixed effect modelling software. The general idea is that the spline is decomposed into two parts:

the perfectly smooth parts of the basis, namely those functions, including constant and linear functions, in the penalty null space of the spline. These are added to the fixed effects model matrix, whilst,
the remaining wiggly parts of the basis are treated as random effects.

Given this duality between splines and random effects, you can reverse the idea and create a spline basis that is the equivalent of a simple Gaussian i.i.d random effect, such that you can fit a GLMM or GAMM using GAM software like mgcv. mgcv has the re basis for this, and I’ll exploit that to fit the zero-inflated GLMMs to the two examples.

In Brooks et al. (2017), two example data sets are used;

Salamanders — Seven combinations of different salamander species and life-stages were repeatedly sampled four times at 23 sites in Applachian streams (Price et al., 2016). Some of the streams were impacted by mountaintop removal and valley filling from coal mining. The data are available from Price et al. (2015), as well as the glmmTMB package.
Owls — the second example is a well-studied one in mixed modelling papers and textbooks (Zuur et al., 2009, (???)–vl), and relates to the begging behaviour of owl nestlings. The data were originally reported in Roulin and Bersier (2007).

Salamanders

Brooks et al. (2017) fit several count models to the Salamander data set, including standard Poisson GLMMs, negative binomial GLMMs, with () estimated and modelled via a linear predictor, as well as zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models. Of these, gam() can currently fit all but the negative binomial with () modelled via a linear predictor and the ZINB models.

The best fitting model of those presented was a negative binomial model, whilst Brooks et al. (2017) also illustrate how to generate fitted values from the ZIP. Rather than go through fitting all of the Brooks et al. (2017) models, I restrict fitting here to these two models. A gist with code to fit all the models that gam() is capable of is available on Github. I have named the models similarly to Brooks et al. (2017) to facilitate comparison.

nbgam2 <- gam(count ~ spp * mined + s(site, bs = "re"), data = Salamanders,
              family = nb, method = "ML")
nbm2 <- glmmTMB(count ~ spp * mined + (1 | site), data = Salamanders,
                family = nbinom2)

As glmmTMB() is currently only capable of fitting models using maximum likelihood, not REML, I use the Laplace approximate maximum likelihood estimation method for gam(). The new nb family in mgcv is for the negative binomial distribution with the (fixed) dispersion parameter () estimated as a model parameter, in the same way that MASS::glm.nb() and lme4::glmer.nb() models do.

In the gam() model, the random effect is specified using the standard s() smooth function with the “re” basis selected. The named variable, here site, should be stored as a factor in the data object to avoid problems.

The figure below compares the coefficient estimates returned by glmmTMB() and gam(); they are very similar, which is encouraging.

nb2.coefs <- data.frame(estimate = c(coef(summary(nbm2))$cond[, "Estimate"], coef(nbgam2)[c(1:14)]),
                        model    = rep(c("glmmTMB", "mgcv::gam"), each = 14),
                        term     = rep(names(coef(nbgam2)[c(1:14)]), 2))

ggplot(nb2.coefs, aes(x = estimate, y = term, colour = model, shape = model)) +
    geom_point(position = position_dodgev(height = 0.3)) +
    labs(y = NULL,
         x = "Regression estimate",
         title = "Comparing mgcv with glmmTMB",
         subtitle = "Salamander: Negative Binomial")

The values (posterior modes, or means) for the site random effect can also be compared

nbgam2.r <- coef(nbgam2)[-c(1:14)]
nbm2.r   <- ranef(nbm2)$cond$site[,1]
nms <- sub("s\\(site\\)\\.", "Site ", names(nbgam2.r))
ranefs <- data.frame(ranef = c(unname(nbgam2.r), nbm2.r),
                     model = rep(c("glmmTMB", "mgcv::gam"), each = length(nbgam2.r)),
                     site  = rep(nms, 2))
ranefs <- transform(ranefs, site = factor(site, nms[order(nbgam2.r)]))

ggplot(ranefs, aes(x = ranef, y = site, colour = model, shape = model)) +
    geom_point(position = position_dodgev(height = 0.5)) +
    labs(y = NULL,
         x = "Random effect",
         title = "Comparing mgcv with glmmTMB",
         subtitle = "Salamanders: Negative Binomial")

As the figure above shows, these too are essentially equivalent for the two fits.

The summary() output for the glmmTMB() model conveniently provides some additional useful information, in the context of GLMMs most notably the estimated variances (or standard deviations) of the random effect terms. As gam() wasn’t designed with GLMMs specifically in mind, the same information is not provided in the the summary() method for gam() model fits. However, Simon Wood has provided the gam.vcomp() function, which can be used to return the variance components of the model in a way that allows comparison with other mixed-models specific software.

summary(nbm2)

 Family: nbinom2  ( log )
Formula:          count ~ spp * mined + (1 | site)
Data: Salamanders

     AIC      BIC   logLik deviance df.resid 
  1663.4   1734.8   -815.7   1631.4      628 

Random effects:

Conditional model:
 Groups Name        Variance Std.Dev.
 site   (Intercept) 0.2842   0.5331  
Number of obs: 644, groups:  site, 23

Overdispersion parameter for nbinom2 family ():    1 

Conditional model:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -3.3750     0.7576  -4.455 8.40e-06 ***
sppPR              0.9306     0.8773   1.061 0.288829    
sppDM              2.2485     0.7878   2.854 0.004314 ** 
sppEC-A            0.7143     0.9052   0.789 0.430029    
sppEC-L            1.8130     0.8130   2.230 0.025741 *  
sppDES-L           2.5111     0.7795   3.221 0.001275 ** 
sppDF              2.5765     0.7801   3.303 0.000957 ***
minedno            4.1619     0.7932   5.247 1.55e-07 ***
sppPR:minedno     -2.5831     0.9328  -2.769 0.005617 ** 
sppDM:minedno     -2.1495     0.8258  -2.603 0.009245 ** 
sppEC-A:minedno   -1.5828     0.9461  -1.673 0.094339 .  
sppEC-L:minedno   -1.3383     0.8493  -1.576 0.115100    
sppDES-L:minedno  -1.9358     0.8164  -2.371 0.017729 *  
sppDF:minedno     -2.7426     0.8217  -3.338 0.000844 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now the gam() version, conveniently with a confidence interval

gam.vcomp(nbgam2)

Standard deviations and 0.95 confidence intervals:

          std.dev    lower     upper
s(site) 0.5325309 0.327768 0.8652132

Rank: 1/1

One further analysis that Brooks et al. (2017) do with the Salamander data (in their Appendix B) is to demonstrate how to generate and plot fitted values from the model. To do this, the analyst needs to consider whether to and how to marginalise over or condition on the random effects. The Appendix has some details on this more generally (via a linked reference) and more-specific pointers on how to go about doing this with glmmTMB() models. In the next few code chunks I will illustrate how to achieve the result from their section Alternative prediction method, where the aim is to predict at the population mode by setting the random effect component to 0. To illustrate this, Brooks et al. (2017) use the more complex ZIP model with linear predictors for both the mean and the zero-inflation components of the model. I fit those models first

## glmmTMB()
zipm3 <- glmmTMB(count ~ spp * mined + (1 | site), zi = ~ spp * mined,
                 data = Salamanders, family = poisson)

## gam()
zipgam3 <- gam(list(count ~ spp * mined + s(site, bs = "re"),
                    ~ spp * mined),
               data = Salamanders, family = ziplss, method = "REML")

The glmmTMB() model has the zero-inflation linear predictor specified via the ziformula argument (abbreviated to zi above). With gam() however, multiple linear predictors are specified via a list of formula objects, only the first of which has a response (left-hand-side). The first formula, with the response, is for the Poisson mean, whilst the second is for the zero-inflation component. Note also that we use the special ziplss() family and that now the model is being estimated using REML, because that is the only option available for these models, which Simon Wood calls general smooth models (Wood et al., 2016). Do note that there is (as of writing) no link argument(s) for the ziplss() family. This is due to the way the model is parameterised internally in the software. This will require us to pay particular attention to the implementation shortly.

To recreate part of Figure B.3 in Appendix B (Brooks et al., 2017), the code below predicts from the fitted gam() model for all combinations of the factors mined and spp. Notice how we have to specify a site in the prediction data, otherwise predict() will throw a tantrum. To set the random effect for site to zero, use the exclude argument. To exclude (i.e. set to zero) any model term, you supply a character vector or list of terms to exclude. For smooth terms, these must be named as they appear in summary(model), hence the use of “s(site)”. The final step is to call predict() with type = “link”. This will return a two column matrix (or a list of two-column matrices if se.fit = TRUE is also used).

## Newdata
newd0 <- newd <- as.data.frame(cbind(unique(Salamanders[, c("mined","spp")]), site = "R -1"))
rownames(newd0) <- rownames(newd) <- NULL
pred <- predict(zipgam3, newd, exclude = "s(site)", type = "link")
head(pred)

        [,1]       [,2]
1 0.36061171 -3.7727169
2 0.94857203  0.3875087
3 0.07601834 -2.6504763
4 0.35637762 -1.3459854
5 0.55867674 -1.4747253
6 1.14836206  0.3266343

The first column is the predicted value of the response from the Poisson part of the model on the scale of the linear predictor (the log scale). The second column is the predicted value from the zero-inflation component and is on the complementary log-log scale. Both of these need to be back transformed to the respective response scales and then multiplied together. To do this for the zero-inflation part, I copied the code from the base R binomial() family with the appropriate link specified. The second line of code below adds the predicted values for each combination of mined and spp to the prediction data object. Note that each component is back-transformed using the appropriate link, and then multiplied together.

ilink <- binomial(link = "cloglog")$linkinv
newd <- transform(newd, fitted = exp(pred[,1]) * ilink(pred[,2]))

A plot of the predicted values is then easily produced

ggplot(newd, aes(x = spp, y = fitted, colour = mined)) +
    geom_point()

Because of the way the gam() model is implemented, I could also have computed the Bayesian credible intervals using the Bayesian covariance matrix of the model parameters via the se.fit argument to predict(). I’ll perhaps save that for another day…

Owls

The Owls data are also available in the glmmTMB package, which I load and then do a little processing of the data to simplify the name of the response variable and to mean centre the ArrivalTime covariate.

data(Owls, package = "glmmTMB")
names(Owls) <- sub("SiblingNegotiation", "NCalls", names(Owls))
Owls <- transform(Owls, cArrivalTime = ArrivalTime - mean(ArrivalTime))

Two ZIP models are considered

a ZIP with constant zero-inflation (an intercept-only model for the zero-inflation), and
a ZIP with complex zero-inflation, where one covariate and a random effect for Nest are included in the linear predictor of the zero-inflation part of the model.

The constant zero-inflation models are fitted using the ziformula argument for glmmTMB with family = poisson, whilst for gam() we use a list of two formula objects, the second for the ZI linear predictor, and the ziplss family. Note that this model could also be fitted using the Zip() family in mgcv but that employs a different, simpler fitting algorithm so to facilitate comparison with the more complex model I use ziplss() instead.

m1.tmb <- glmmTMB(NCalls ~ (FoodTreatment + cArrivalTime) * SexParent + offset(logBroodSize) + (1 | Nest),
                  ziformula = ~ 1, data = Owls, family = poisson)
m1.gam <- gam(list(NCalls ~ (FoodTreatment + cArrivalTime) * SexParent + offset(logBroodSize) + s(Nest, bs = "re"),
                   ~ 1),
              data = Owls, family = ziplss())

Again note that these models are not estimated in the same way; glmmTMB() estimates the model parameters using maximum likelihood, whilst only REML estimation is available for the ziplss() family with gam(). In gam(), the intercept-only ZI linear predictor is specified with the formula ~ 1.

To compare the estimates of the model coefficients I wrote a little function to extract the estimated values and their standard errors from the two model objects

createCoeftab <- function(TMB, GAM, GAMrange) {
    bTMB <- fixef(TMB)$cond[-1]
    bGAM <- coef(GAM)[GAMrange]
    seTMB <- diag(vcov(TMB)$cond)[-1]
    seGAM <- diag(vcov(GAM))[GAMrange]
    nms <- names(bTMB)
    nms <- sub("FoodTreatment", "FT", nms)
    nms <- sub("cArrivalTime", "ArrivalTime", nms)
    df <- data.frame(model    = rep(c("glmmTMB", "mgcv::gam"), each = length(bGAM)),
                     term     = rep(nms, 2),
                     estimate = unname(c(bTMB, bGAM)))
    df <- transform(df,
                    upper = estimate + sqrt(c(seTMB, seGAM)),
                    lower = estimate - sqrt(c(seTMB, seGAM)))
    df
}

Passing each of the models to createCoeftab()

m1.coefs <- createCoeftab(m1.tmb, m1.gam, GAMrange = 2:6)

results in a tidy data frame suitable for plotting with ggplot().

ggplot(m1.coefs, aes(x = estimate, y = term, colour = model, shape = model, xmax = upper, xmin = lower)) +
    geom_pointrangeh(position = position_dodgev(height = 0.3)) +
    labs(y = NULL,
         x = "Regression estimate",
         title = "Comparing mgcv with glmmTMB",
         subtitle = "Owls: ZIP with constant zero-inflation",
         caption = "Bars are ±1 SE")

Comparison of estimated model fixed effect parameters for the constant zer-inflation model fitted to the owl nestling behaviour data.

As can be seen in the figure, the estimates from the two functions are quite similar.

The more-complex models with covariates in the ZI linear predictor are fitted next

m2.tmb <- glmmTMB(NCalls ~ (FoodTreatment + cArrivalTime) * SexParent +
                      offset(logBroodSize) + (1 | Nest),
                  ziformula = ~ FoodTreatment + (1 | Nest), data = Owls, family = poisson)
m2.gam <- gam(list(NCalls ~ (FoodTreatment + cArrivalTime) * SexParent +
                       offset(logBroodSize) + s(Nest, bs = "re"),
                   ~ FoodTreatment + s(Nest, bs = "re")),
              data = Owls, family = ziplss())

As before, we gather the model coefficients

m2.coefs <- createCoeftab(m2.tmb, m2.gam, GAMrange = 2:6)

and plot them

ggplot(m2.coefs, aes(x = estimate, y = term, colour = model, shape = model,
                     xmax = upper, xmin = lower)) +
    geom_pointrangeh(position = position_dodgev(height = 0.3)) +
    labs(y = NULL,
         x = "Regression estimate",
         title = "Comparing mgcv with glmmTMB",
         subtitle = "Owls: ZIP with complex zero-inflation",
         caption = "Bars are ±1 SE")

Comparison of estimated model fixed effect parameters for the complex zer-inflation model fitted to the owl nestling behaviour data.

and likewise as before, the estimates of the fixed effect terms are very similar indeed.

Conclusions

The comparisons shown above show that mgcv::gam() and glmmTMB() produce very similar estimates for the two models. And some crude timings showed that gam() was 20–40% faster than glmmTMB() at fitting the examples discussed in the paper. So all is roses, right!? Who needs glmmTMB()?

That would however, be totally the wrong message to take from this comparison. Most notably, and something that isn’t surfaced in these simple examples is that gam() is limited in the complexity of the random effects it can efficiently represent in models:

it can’t do correlated random effects for random slopes and intercepts models (as far as I can tell anyway), and, and this is probably the deal breaker,
model fitting with gam() gets bogged down quickly if the number of levels in a random effect gets large. Jamie Ashander did some quick tests with a larger version of the Salamander with 100s of sites and glmmTMB() totally dominated gam().

And that’s fine; gam() was not designed to fit GLMMs — there are no less than three implementations by Simon Wood alone of functions to fit GAMs with complex random effects in mixed model software (gamm() to fit with lme(), gamm4() to fit using lmer() and glmer(), and jagam() in mgcv to fit via JAGS). Furthermore, glmmTMB() is currently more flexible in the range of models that it can fit than any these implementations, except for JAGS, because the nb, Zip, and ziplss families only work with gam().

What the above comparison illustrates, however, is that if you either don’t have complex or many random effects or that you don’t mind running models overnight, gam() is a good option for fitting GLMMs. Plus you have the advantage of estimating smooth functions of covariates, which is one area where glmmTMB() is currently very lacking compared to gam().

That said, it should be possible to emulate what Paul-Christian Bürkner has done in his brms package (and similar implementations by Simon Wood in gamm4()) to use mgcv to set up the correct model matrices for the random effect representation of splines which can then be fitted using glmmTMB().

Finally, this was a fun exercise to replicate the analyses in Brooks et al. (2017), motivated by a desire to understand what mgcv and gam() are doing with these random effect splines. It wasn’t intended as a prize-fight between two title contenders — hopefully this write-up didn’t come across that way. I also learned a lot more about glmmTMB, which is shaping up nicely and looks like it’ll have a place in my modelling toolbox.

References

Bolker, B. M., Gardner, B., Maunder, M., Berg, C. W., Brooks, M., Comita, L., et al. (2013). Strategies for fitting nonlinear ecological models in r, AD model builder, and BUGS. Methods in ecology and evolution / British Ecological Society 4, 501–512. doi:10.1111/2041-210X.12044.

Brooks, M. E., Kristensen, K., Benthem, K. J. van, Magnusson, A., Berg, C. W., Nielsen, A., et al. (2017). Modeling Zero-Inflated count data with glmmTMB. bioRxiv, 132753. doi:10.1101/132753.

Price, S. J., Muncy, B. L., Bonner, S. J., Drayer, A. N., and Barton, C. D. (2015). Data from: Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders. doi:10.5061/dryad.5m8f6.

Price, S. J., Muncy, B. L., Bonner, S. J., Drayer, A. N., and Barton, C. D. (2016). Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders. The Journal of applied ecology 53, 459–468. doi:10.1111/1365-2664.12585.

Roulin, A., and Bersier, L.-F. (2007). Nestling barn owls beg more intensely in the presence of their mother than in the presence of their father. Animal behaviour 74, 1099–1106.

Wood, S. N., Pya, N., and Säfken, B. (2016). Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association 111, 1548–1563. doi:10.1080/01621459.2016.1180986.

Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., and Smith, G. M. (2009). Mixed effects models and extensions in ecology with r: Springer New York doi:10.1007/978-0-387-87458-6.

Prediction intervals for GLMs part II

2017-05-01T17:00:00+02:00

One of my more popular answers on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). Comments, even on StackOverflow, aren’t a good place for a discussion so I thought I’d post something hereon my blog that went into a bit more detail as to why, for some common types of GLMs, prediction intervals aren’t that useful and require a lot more thinking about what they mean and how they should be calculated. I’ve broken it into two and in this, the second part, I look at Possion models.

The second example — purely because I happen to have it handy from teaching this semester — is from Korner-Nievergelt et al. (2015), and concerns the number of breeding pairs of the common whitethroat (Silvia communis). This species likes to inhabit field margins and fallow lands and has been adversely affected by intensive agricultural activities reducing these types of habitat on the landscape. As a mitigiation effort, wildflower fields are sown and left largely unmanaged for several years. The data come from a study looking at how the number of breeding pairs of common whitethroat change as the composition and structure of the plant community changes over time. The data are in the blmeco package available on CRAN.

## install.packages("blmeco") # first, if not already installed
library("blmeco")
data(wildflowerfields)
library("ggplot2")
theme_set(theme_bw())

The example in Korner-Nievergelt et al. (2015) uses a Poisson GLM with a quadratic effect of the variable age. Instead I’ll use a Poisson GAM, but in all other respects the analysis follows that from the text book (only the year 2007 data are used, field size transformed to hectares).

library("mgcv")
wf <- transform(subset(wildflowerfields, year == 2007), size = size / 100)
wf <- transform(wf, size.z = (size - mean(size)) / sd(size))
mod <- gam(bp ~ s(age, k = 6) + size.z + offset(log(size)), data = wf, family = poisson, method = "REML")
summary(mod)

Family: poisson 
Link function: log 

Formula:
bp ~ s(age, k = 6) + size.z + offset(log(size))

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1791     0.3333  -3.538 0.000404 ***
size.z       -0.5283     0.2893  -1.826 0.067861 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df Chi.sq p-value  
s(age) 2.608  3.223  7.323   0.074 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.207   Deviance explained = 46.3%
-REML = 33.589  Scale est. = 1         n = 41

The primary variable of interest shows a moderate amount of non-linearity, similar to that of the quadratic effect of age in the version from the text book, though the effect of field age is weak at best. The fitted model is illustrated graphically below, holding size constant at the mean field size

ilink <- family(mod)$linkinv
newd <- with(wf, expand.grid(age = seq(min(age), max(age), length = 300),
                             size = 1, size.z = 0))
newd <- cbind(newd, as.data.frame(predict(mod, newd, type = "link", se.fit = TRUE)))
newd <- transform(newd, fitted = ilink(fit), upper = ilink(fit + (2 * se.fit)),
                  lower = ilink(fit - (2 * se.fit)))
ggplot(wf, aes(x = age, y = bp/size)) +
    geom_ribbon(data = newd, mapping = aes(ymin = lower, ymax = upper, x = age),
                alpha = 0.1, inherit.aes = FALSE) +
    geom_line(data = newd, mapping = aes(y = fitted)) +
    geom_jitter(width = 0.1, height = 0) +
    labs(x = "Age [years]", y = expression("Number of Breeding Pairs [" * pairs ~ ha^{-1} * "]"),
         title = "Common Whitethroat densities in Wildflower Fields",
         subtitle = "Estimated densities for average field of ~1.8 ha")

The fitted GAM for the common whitethroat data, showing the estimated number of breeding pairs per hectare with a 95% pointwise confidence interval. The points are the observed densities of breeding pairs.

So far so good, but how do we interpret this model? For simplicity, lets assume that fields only come in integer ages. What the model implies is that for each integer age the observations are best fitted by (or described by; or generated from) a Poisson model with parameter () equal to the value of the solid line at each particular age. This value of () is just an estimate of the true value and so we might envisage the observations for each year as having come from Poisson distributions with values of () given by the values of the upper and lower confidence band also shown in the figure above. For fields of two and five years of age these distributions look like this

The fitted Poisson distributions for the two field ages are shown by the green points and lines in the figure above. The effect of field age is to shift the estimated Poisson distribution to the right, towards on average higher numbers of breeding pairs. The uncertainty in the estimated model is shown by the orange and blue points and lines; these are based on the lower and upper 95% pointwise confidence interval on the estimate mean number of breeding pairs for fields of two and five years of age. The orange points illustrate the Poisson distribution from which the points might have been derived if the true value of () were at the lower end of the confidence interval. The blue points show the Poisson distribution if the true values of () was at the upper end of the confidence interval. Each of these distributions implies, potentially at least, different predicted numbers of breeding pairs.

We have estimated the expected number of breeding pairs given the age of the field and it’s size. We also have a (pointwise) 95% confidence interval on that expectation. As before, this isn’t a prediction interval, so what would one of those look like in this case? Somewhat similar to those we created for the binomial GLM earlier, except now we have posterior densities (the probability density implied by the Poisson distribution with () given as a function of field age) for all the integers 0–∞, although once we get above 10 breeding pairs the density is going to be effectively 0 even if not technically so.

Note that I said integers above; we can’t have 2.5 breeding pairs as a prediction. Hence any prediction interval is really talking about points of probability for each integer ({0, 1, 2, }) (even if we might consider a much smaller upper limit than that) not a continuous interval. Having said that, perhaps I’m being to pedantic? In some instances, the upper and lower 2.5^th and 97.5^th probability quantiles of the implied Poisson distribution do begin to look more like a prediction interval.

To illustrate I’ll work my way through some code to illustrate some ways of thinking about what the fitted model says in terms of predicting the numbers of breeding pairs of common whitethroats. First a little bit of prep; I’ll illustrate various intervals for two hypothetical fields or average size, one created two years ago and a second five years ago.

p <- data.frame(age = c(2, 5), size = 1, size.z = 0)
pred <- setNames(as.data.frame(predict(mod, p, se.fit = TRUE)), c("fit", "se"))
p <- cbind(p, pred)
p <- transform(p, lower = ilink(fit - (2 * se)),
               lambda = ilink(fit), upper = ilink(fit + (2 * se)))
p

  age size size.z        fit        se      lower    lambda     upper
1   2    1      0 -1.7184424 0.5805902 0.05615594 0.1793453 0.5727752
2   5    1      0 -0.1295676 0.3260103 0.45767855 0.8784752 1.6861589

p contains the estimated value of () (the expected number of breeding pairs), and the upper and lower 95% pointwise interval about this expected count, for the two fields. First, the 95% interval for the model estimated () for the younger of the two fields, based on qpois(), the quantile function of the conditional distribution of the number of breeding pairs given field age

qpois(c(0.025, 0.975), lambda = p[1, "lambda"])

[1] 0 1

Hence we might expect either 0, 1 breeding pairs. But, we haven’t accounted for the uncertainty in the estimated (). At the lower end of of the 95% interval on the estimated () the prediction interval would be

qpois(c(0.025, 0.975), lambda = p[1, "lower"])

[1] 0 1

and

qpois(c(0.025, 0.975), lambda = p[1, "upper"])

[1] 0 2

for the upper end, leading to a prediction interval of 0–2 breeding pairs. The same prediction interval for the five year old field would be

qpois(c(0.025, 0.975), lambda = c(p[2, "lower"], p[2, "upper"]))

[1] 0 5

We can also look at the probability densities of the poisson distribution for the estimated value of () and its 95% confidence interval. The table below shows these probability densities for a two-year old field

Posterior densities for selected numbers of breeding pairs for a two year old field. Columns show the densities for a poisson distribution with () equal to the esimated value and the lower and upper limits on the estimated values for this field.
# of pairs	lower	estimate	upper
0	0.9454	0.8358	0.5640
1	0.0531	0.1499	0.3230
2	0.0015	0.0134	0.0925
3	0.0000	0.0008	0.0177
4	0.0000	0.0000	0.0025
5	0.0000	0.0000	0.0003
6	0.0000	0.0000	0.0000
7	0.0000	0.0000	0.0000

For example, we’d expect to observe no breeding pairs in 95–56% of average sized, two-year old fields. The same values are shown for a five-year old tree in the table below

Posterior densities for selected numbers of breeding pairs for a five year old field. Columns show the densities for a poisson distribution with () equal to the esimated value and the lower and upper limits on the estimated values for this field.
# of pairs	lower	estimate	upper
0	0.6328	0.4154	0.1852
1	0.2896	0.3649	0.3123
2	0.0663	0.1603	0.2633
3	0.0101	0.0469	0.1480
4	0.0012	0.0103	0.0624
5	0.0001	0.0018	0.0210
6	0.0000	0.0003	0.0059
7	0.0000	0.0000	0.0014

I could repeat the process of simulating breeding pairs from the poisson distributions with estimated values of () but the code to illustrate this gets tedious and this post is long enough already.

The prediction intervals for the Poisson model are starting to look more like intervals than the ones for the binomial model we looked at earlier. They’re still not something we can easily convey on a plot like we can with linear models and predict.lm, however.

For continous conditional distributions, prediction “intervals” act like their linear model counterparts, as long as we take the extra step of computing the prediction interval using the probability quantile function (the qfoo() functions in R where foo is the abbreviation for the distribution) and potentially include the uncertainty in the estimated expectations (fitted values on the response scale) as we did in both examples above.

Ok, I think that’s enough modelling pedantry for one [Ed: er-um, two] post.

References

Korner-Nievergelt, F., Felten, S. von, Roth, T., Gulat, J., Almasi, B., and Korner-Nievergelt, P. (2015). Bayesian data analysis in ecology using linear models with R, BUGS, and stan. Elsevier Science & Technology Books.

Prediction intervals for GLMs part I

2017-05-01T16:45:00+02:00

One of my more popular answers on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). My answer really only addresses how to compute confidence intervals for parameters but in the comments I discuss the more substantive points raised by the OP in their question. Lately there’s been a bit of back and forth between Jarrett Byrnes and myself about what a prediction “interval” for a GLM might mean. Comments, even on StackOverflow, aren’t a good place for a discussion so I thought I’d post something here that went into a bit more detail as to why, for some common types of GLMs, prediction intervals aren’t that useful and require a lot more thinking about what they mean and how they should be calculated. For illustration, I thought I’d use some small teaching example data sets, but whilst writing the post it started to get a little on the long side. So, I’ve broken it into two and in this part I look at logistic regression.

The first example concerns a small experiment on the rare insectivorous pitcher plant Darlingtonia californica (the cobra lily) used as an example in Gotelli and Ellison (2013) and originally reported in Dixon et al. (2005). Darlingtonia grows leaves that are modified to form a pitcher trap, which is filled with nectar that attracts insects, in particular vespulid wasps (Vespula atropilosa). The observations in the data set are on the height of pitcher traps (leafHeight) and whether or not the leaf was visited by a wasp (visited). The code chunk below downloads the data from the book’s website and loads it into R ready for use.

darlurl <- "http://harvardforest.fas.harvard.edu/sites/harvardforest.fas.harvard.edu/files/ellison-pubs/2004/DarlingtoniaData3.txt"
darl <- setNames(read.fwf(darlurl, widths = c(8,9), header = FALSE, skip = 1L),
                 c("leafHeight", "visited"))
darl <- transform(darl, visited = as.logical(visited))

Kernel density estimates of the distributions of the leaf heights for visited and unvisited leaves is one way to visualise these data. Here we use ggplot2

library("ggplot2")
theme_set(theme_bw())

xlab <- "Leaf height [cm]"
ggplot(darl, aes(x = leafHeight, colour = visited)) +
    geom_line(stat = "density") + labs(x = xlab, y = "Density")

Kernel density estimates of the distribution of heights of leaves visited or not by wasps.

We’re interested in modelling the probability of leaf visitation as a function of leaf height. For this a binomial GLM is a logical choice, with the canonical link function, the logit or logistic function. Such a model is fitted using glm() as follows

m <- glm(visited ~ leafHeight, data = darl, family = binomial)
summary(m)

Call:
glm(formula = visited ~ leafHeight, family = binomial, data = darl)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.18274  -0.46820  -0.23897  -0.08519   1.90573  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.29295    2.16081  -3.375 0.000738 ***
leafHeight   0.11540    0.03655   3.158 0.001591 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 46.105  on 41  degrees of freedom
Residual deviance: 26.963  on 40  degrees of freedom
AIC: 30.963

Number of Fisher Scoring iterations: 6

The model summary suggests an effect of leaf height that is unlikely to be observed if there were no effect. For a unit increase in leaf height, the odds of visitation increase by 1.12 times (given by exp(coef(m)[2])).

How the probability of visitation varies as a function of leaf height, as estimated by the binomial GLM, can be visualised by predicting for a grid of values over the observed range of leaf heights. An approximate 95% point-wise confidence interval can also be created for the fitted function. In this case, we should create the confidence interval on the scale of the linear predictor where we assume things behave in a more Gaussian-like manner, and then backtransform the calculated interval on to the probability scale using the invers of the link function. The code below shows a general solution for this, where the inverse link function is obtained from the family() object contained within the fitted GLM object

ilink <- family(m)$linkinv
pd <- with(darl,
           data.frame(leafHeight = seq(min(leafHeight), max(leafHeight),
                                       length = 100)))
pd <- cbind(pd, predict(m, pd, type = "link", se.fit = TRUE)[1:2])
pd <- transform(pd, Fitted = ilink(fit), Upper = ilink(fit + (2 * se.fit)),
                Lower = ilink(fit - (2 * se.fit)))

ggplot(darl, aes(x = leafHeight, y = as.numeric(visited))) +
    geom_ribbon(data = pd, aes(ymin = Lower, ymax = Upper, x = leafHeight),
                fill = "steelblue2", alpha = 0.2, inherit.aes = FALSE) +
    geom_line(data = pd, aes(y = Fitted, x = leafHeight)) +
    geom_point() +
    labs(y = "Probability of visitation", x = xlab)

Estimated probability of visitation plus pointwise 95% confidence interval.

So far, so standard; the confidence interval is just that, a Wald confidence interval on the fitted function based on the standard errors of the estimates of the model coefficients. It is not a prediction interval, however.

The fitted model can be interpreted as describing the binomial distribution for any given value of leafHeight. The binomial distribution is specified by two parameters; n the number of trials (specified via argument size in R’s dbinom() and related functions), and p the probability of success. In the Darlingtonia example, n is 1 because each leaf was the result of 1 trial; was the leaf visited or not during the experiment? p is given by (g()^{-1} = g(_0 + _1 )^{-1}), where (g) is the logit link function and (g^{-1}) is its inverse. In other words, the probability parameter of the binomial distribution is a function of leafHeight.

To create a prediction interval for a value of leafHeight, we could look at the probability quantiles of the binomial distribution with size = 1 and prob = Fitted[leafHeight]. For example, for the minimum and maximum observed leaf heights the extreme 2.5% and 97.5% probability quantiles are

with(pd, qbinom(c(0.025, 0.975), size = 1, prob = head(Fitted, 1L)))
with(pd, qbinom(c(0.025, 0.975), size = 1, prob = tail(Fitted, 1L)))

[1] 0 0
[1] 0 1

In the first instance, for the minimum observed leaf height, the prediction interval is 0. Yes, just 0. For the maximum observed leaf height the 95% prediction interval is 0–1. Neither of these is very useful; one isn’t even an interval in the usual sense of the word, and the other is so wide as to encompass both 0 and 1, which is no more information than we had before we started the whole exercise — a leaf can only be visited or not.

But this isn’t quite what we want; we’ve only explore the quantiles of the distributions conditional upon the estimated probability. A real prediction interval would account for the uncertainty in this estimate. For that, we need the upper and lower confidence limits for the estimated probability.

with(pd, qbinom(c(0.025, 0.975), size = 1, prob = c(head(Lower, 1L), head(Upper, 1L))))
with(pd, qbinom(c(0.025, 0.975), size = 1, prob = c(tail(Lower, 1L), tail(Upper, 1L))))

[1] 0 1
[1] 0 1

I think we can all agree that these intervals aren’t really that useful…

Another way to use the fitted model is via what it says about the posterior density of the two possible predicted values, visited or unvisited. This can be computed using dbinom() using the code below, again for the minimum and maximum observed leaf heights

db <- with(pd, matrix(dbinom(x = rep(c(0,1), each = 2), size = 1,
                             prob = Fitted[c(1, 100)]),
                      ncol = 2))
colnames(db) <- c("NotVisited", "Visited")
rownames(db) <- with(pd, paste("leafHeight =", range(leafHeight)))
round(db, 4)

                NotVisited Visited
leafHeight = 14     0.9966  0.0034
leafHeight = 84     0.0831  0.9169

We see almost all the probability density on the unvisited option for leaves 14cm in height (which is also why the 95% interval we calculated earlier was all on unvisted (0), we’d need to go beyond a 99.7% interval to get the visited alernative (1) included in the interval). For leaves of 84cm, most of the density is on the visited outcome, but with approximately 8% on the unvisited outcome.

However, these values are exactly what we get if we just take the fitted probabilities for these leaf heights, which are given by the solid line in the plot we made earlier

with(pd, Fitted[c(1, 100)])

[1] 0.003410958 0.916879065

These values are for the visited outcome, but subtract them from 1 and you have the values for the unvisited outcome

with(pd, 1 - Fitted[c(1, 100)])

[1] 0.99658904 0.08312093

As before, this ignores the uncertainty in the estimated probability of visitation. The densities incorporating this uncertainty are shown in the table below

Estimated probability of the visited and not-visited outcomes based on the upper (upr) and lower (lwr) 95% interval of the model-estimated probability of visitation for two leaf heights.
	Not Visited (lwr)	Not Visited (upr)	Visited (lwr)	Visited (upr)
leafHeight = 14	0.9999	0.9125	0.0001	0.0875
leafHeight = 84	0.4415	0.0103	0.5585	0.9897

One more thing we can do with the fitted model is simulate random outcomes from it. Again we do this for the minimum and maximum observed leaf heights, first for the lowest leaf height

nrand <- 10000
set.seed(1)
table(rbinom(nrand, size = 1, prob = with(pd, Fitted[1]))) / nrand

     0      1 
0.9977 0.0023

and then for the largest observed leaf height

set.seed(1)
table(rbinom(nrand, size = 1, prob = with(pd, Fitted[100]))) / nrand

     0      1 
0.0867 0.9133

The numbers should look pretty familiar — they are very close to both the posterior densities returned using dbinom() and the the fitted probabilities we just looked at. In fact, as nrand tends to infinity, the proportions of the two outcomes will also approach those given by dbinom(). As before, though I won’t show it, a complete interval would also include the uncertainty in the estimated probability.

In this example, the most useful outputs from the model are all based on the binomial distributions given values of leaf height. The interval given by the extreme 2.5th and 97.5th probability quantiles isn’t of much use at all; for the two values of leaf height we looked at the interval either wasn’t an interval or it told us no more information than we already possessed, that leaves either were or were not visited.

That said, this binomial GLM example is pretty extreme; the observed data only take values 0 or 1 and nothing else. However, this has been a useful exercise to think about what the fitted model represents.

In the second part of this post I’ll look at a model for a count response, which will start to look a little more interval-like than the one here.

References

Dixon, P. M., Ellison, A. M., and Gotelli, N. J. (2005). Improving the precision of estimates of the frequency of rare events. Ecology 86, 1114–1123. doi:10.1890/04-0601.

Gotelli, N. J., and Ellison, A. M. (2013). A primer of ecological statistics. second. Sinauer Associates Inc.

Simultaneous intervals for derivatives of smooths revisited

2017-03-21T00:00:00+01:00

Eighteen months ago I screwed up! I’d written a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalized spline. It was a nice post that attracted some interest. It was also wrong. In December I corrected the first part of that mistake by illustrating one approach to compute an actual simultaneous interval, but only for the fitted smoother. At the time I thought that the approach I outlined would translate to the derivatives but I was being lazy then Christmas came and went and I was back to teaching — you know how it goes. Anyway, in this post I hope to finally rectify my past stupidity and show how the approach used to generate simultaneous intervals from the December 2016 post can be applied to the derivatives of a spline.

If you haven’t read the December 2016 post I suggest you do so as there I explain this:

[ \[\begin{align} \mathbf{\hat{f}_g} &\pm m_{1 - \alpha} \begin{bmatrix} \widehat{\mathrm{st.dev}} (\hat{f}(g_1) - f(g_1)) \\ \widehat{\mathrm{st.dev}} (\hat{f}(g_2) - f(g_2)) \\ \vdots \\ \widehat{\mathrm{st.dev}} (\hat{f}(g_M) - f(g_M)) \\ \end{bmatrix} \end{align}\] ]

This equation states that the critical value for a 100(1 - ())% simultaneous interval is given by the 100(1 - ())% quantile of the distribution of the standard errors of deviations of the fitted values from the true values of the smoother. We don’t know this distribution, so we generated realizations from it using simulation, and used the empirical quantiles of the simulated distribution to give the appropriate critical value (m) with which to calculate the simultaneous interval. In that post I worked my way through some R code to show how you can calculate this for a fitted spline.

To keep this post relatively short, I won’t rehash the discussion of the code used to compute the critical value (m). I also won’t cover in detail how these derivatives are computed. We use finite differences and the general approach is explained in an older post. I don’t recommend you use the code in that post for real data analysis, however. Whilst I was putting together this post I re-wrote the derivative code as well as that for computing point-wise and simultaneous intervals and started a new R package tsgam. tsgam is is available on GitHub and we’ll use it here. Note this package isn’t even at version 0.1 yet, but the code for derivatives and intervals has been through several iterations now and works well whenever I have tested it.

Assuming you have the devtools package installed, you can install tsgam using

devtools::install_github("gavinsimpson/tsgam")

As example data, I’ll again use the strontium isotope data set included in the SemiPar package, and which is extensively analyzed in the monograph Semiparametric Regression (Ruppert et al., 2003). First, load the packages we’ll need as well as the data, which is data set fossil. If you don’t have SemiPar installed, install it using install.packages(“SemiPar”) before proceeding

library("mgcv")                         # fit the GAM
library("tsgam")                        # code for derivatives & intervals
library("ggplot2")                      # package for nice plots
theme_set(theme_bw())                   # simpler theme for the plots
data(fossil, package = "SemiPar")       # load the data

The fossil data set includes two variables and is a time series of strontium isotope measurements on samples from a sediment core. The data are shown below using ggplot()

ggplot(fossil, aes(x = age, y = strontium.ratio)) +
    geom_point() + scale_x_reverse()

The strontium isotope example data used in the post

The aim of the analysis of these data is to model how the measured strontium isotope ratio changed through time, using a GAM to estimate the clearly non-linear change in the response. As time is the complement of sediment age, we should probably model this on that time scale, especially if you wanted to investigate residual temporal auto-correlation. This requires creating a new variable negAge for modelling purposes only

fossil <- transform(fossil, negAge = -age)

As per the previous post a reasonable GAM for these data is fitted using mgcv and gam()

m <- gam(strontium.ratio ~ s(negAge, k = 20), data = fossil, method = "REML")

Having fitted the model we should do some evaluation of it but I’m going to skip that here and move straight to computing the derivative of the fitted spline and a simultaneous interval for it. First we set some constants that we can refer to throughout the rest of the post

## parameters for testing
UNCONDITIONAL <- FALSE # unconditional or conditional on estimating smooth params?
N <- 10000             # number of posterior draws
n <- 500               # number of newdata values
EPS <- 1e-07           # finite difference

To facilitate checking that this interval has the correct coverage properties I’m going to fix the locations where we’ll evaluate the derivative, calculating the vector of values to predict at once only. Normally you wouldn’t need to do this just to compute the derivatives and associated confidence intervals — you would just need to set the number of values n over the range of the predictors you want — and if you have a model with several splines it is probably easier to let tsgam handle this part for you.

## where are we going to predict at?
newd <- with(fossil, data.frame(negAge = seq(min(negAge), max(negAge), length = n)))

The fderiv() function in tsgam computes the first derivative of any splines in the supplied GAM¹ or you can request derivatives for a specified smooth term. As we have only a single smooth term in the model, we simply pass in the model and the data frame of locations at which to evaluate the derivative

fd <- fderiv(m, newdata = newd, eps = EPS, unconditional = UNCONDITIONAL)

(we set eps = EPS, so we have the same grid shift later in the post when checking coverage properties, and don’t account for the uncertainty due to estimating the smoothness parameters (unconditional = FALSE), normally you can leave these at the defaults). The object returned by fderiv()

str(fd, max = 1)

List of 6
 $ derivatives  :List of 1
 $ terms        : chr "negAge"
 $ model        :List of 52
  ..- attr(*, "class")= chr [1:3] "gam" "glm" "lm"
 $ eps          : num 1e-07
 $ eval         :'data.frame':  500 obs. of  1 variable:
 $ unconditional: logi FALSE
 - attr(*, "class")= chr "fderiv"

contains a component derivatives that contains the evaluated derivatives for all or the selected smooth terms. The other components include a copy of the fitted model and some additional parameters that are required for the confidence intervals. Confidence intervals for the derivatives are computed using the confint() method. The type argument specifies whether point-wise or simultaneous intervals are required. For the latter, the number of simulations to draw is required via nsim

set.seed(42)                            # set the seed to make this repeatable 
sint <- confint(fd, type = "simultaneous", nsim = N)

To make it easier to work with the results I wrote the confint() method so that it returned the confidence interval as a tidy data frame suitable for plotting with ggplot2. sint is a data frame with an identifier for which smooth term each row relates to (term), plus columns containing the estimated (est) derivative and the lower and upper confidence interval

head(sint)

    term     lower      est    upper
1 negAge -0.000053 -6.1e-06 0.000041
2 negAge -0.000053 -6.1e-06 0.000041
3 negAge -0.000053 -6.1e-06 0.000040
4 negAge -0.000052 -6.1e-06 0.000040
5 negAge -0.000052 -6.1e-06 0.000040
6 negAge -0.000051 -6.0e-06 0.000039

The estimated derivative plus its 95% simultaneous confidence interval are shown below

ggplot(cbind(sint, age = -newd$negAge),
       aes(x = age, y = est)) +
    geom_hline(aes(yintercept = 0)) +
    geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
    geom_line() +
    scale_x_reverse() +
    labs(y = "First derivative", x = "Age")

Estimated first derivative of the spline fitted to the strontium isotope data. The grey band shows the 95% simultaneous interval.

So far so good.

Having thought about how to apply the theory outlined in the previous post, it seems that all we need to do to apply it to derivatives is to make the assumption that the estimate of the first derivative is unbiased and hence we can proceed as we did in the previous post by computing BUdiff using a multivariate normal with zero mean vector and the Bayesian covariance matrix of the model coefficients. Where the version for derivatives differs is that we use a prediction matrix for the derivatives instead of for the fitted spline. This prediction matrix is created as follows

generate a prediction matrix from the current model for the locations in newd,
generate a second prediction matrix as before but for slightly shifted locations newd + eps
difference these two prediction matrices yielding the prediction matrix for the first differences Xp
for each smooth in turn
1. create a zero matrix, Xi, of the same dimensions as the prediction matrices
2. fill in the columns of Xi that relate to the current smooth using the values of the same columns from Xp
3. multiply Xi by the vector of model coefficients to yield predicted first differences
4. calculate the standard error of these predictions

The matrix Xi is supplied for each smooth term in the derivatives component of the object returned by fderiv().

Once I’d grokked this one basic assumption about the unbiasedness of the first derivative, the rest of the translation of the method to derivatives fell into place. As we are using finite differences, we may be a little biased in estimating the first derivatives, but this can be reduced by makes eps smaller, thought the default probably suffices.

To see the detail of how this is done, look at the source code for tsgam:::simultaneous, which apart from a bit of renaming of objects follows closely the code in the previous post.

Having computed the purported simultaneous interval for the derivatives of the trend, we should do what I didn’t do in the original posts about these intervals and go and look at the coverage properties of the generated interval.

To do that I’m going to simulate a large number, N, of draws from the posterior distribution of the model. Each of these draws is a fitted spline that includes the uncertainty in the estimated model coefficients. Note that I’m not including a correction here for the uncertainty due to smoothing parameters being estimated — you can set unconditional = TRUE throughout (or change UNCONDITIONAL above) to include this extra uncertainty if you wish.

Vb <- vcov(m, unconditional = UNCONDITIONAL)
set.seed(24)
sims <- MASS::mvrnorm(N, mu = coef(m), Sigma = Vb)
X0 <- predict(m, newd, type = "lpmatrix")
newd <- newd + EPS
X1 <- predict(m, newd, type = "lpmatrix")
Xp <- (X1 - X0) / EPS
derivs <- Xp %*% t(sims)

The code above basically makes a large number of draws from the model posterior and applies the steps of the algorithm outlined above to generate derivs, a matrix containing 10000 draws from the posterior distribution of the model derivatives. Our simultaneous interval should entirely contain about 95% of these posterior draws. Note that a draw here refers to the entire set of evaluations of the first derivative for each posterior draw from the model. The plot below shows 50 such draws (lines)

set.seed(2)
matplot(derivs[, sample(N, 50)], type = "l", lty = "solid")

50 draws from the posterior distribution of the first derivative of the fitted spline.

and 95% of the 10000 draws (lines) should lie entirely within the simultaneous interval if it has the right coverage properties. Put the other way, only 5% of the draws (lines) should ever venture outside the limits of the interval.

To check this is the case, we reuse the the inCI() function that checks if a draw lies entirely within the interval or not

inCI <- function(x, upr, lwr) {
    all(x >= lwr & x <= upr)
}

As each column of derivs contains a different draw, we want to apply inCI() to each column in turn

fitsInCI <- with(sint, apply(derivs, 2L, inCI, upr = upper, lwr = lower))

inCI() returns a TRUE if all the points that make up the line representing a single posterior draw lie within the interval and FALSE otherwise, therefore we can sum up the TRUEs (recall that a TRUE == 1 and a FALSE == 0) and divide by the number of draws to get an estimate of the coverage properties of the interval. If we do this for our interval

sum(fitsInCI) / length(fitsInCI)

[1] 0.95

we see that the interval includes 95% of the 10000 draws. Which, you’ll agree is pretty close to the desired coverage of 95%.

That’s it for this post; whilst the signs are encouraging that these simultaneous intervals have the required coverage properties, I’ve only looked at them for a simple single-term GAM, and only for a response that is conditionally distributed Gaussian. I also haven’t looked at anything other than the coverage at an expected 95%. If you do use this in your work, please do check that the interval is working as anticipated. If you do discover problems, please let me know either in the comments below or via email. The next task is to start thinking about extending these ideas to work with a wider range of GAMs that mgcv can fit, include location-scale models and models with factor-smooth interactions.

References

Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press.

fderiv() currently works for smooths of a single variable fitted using gam() or gamm(). It hasn’t been tested with the location-scale extended families in newer versions of mgcv and I doubt it will work with them currently. ↩

Modelling extremes using generalized additive models

2017-01-25T00:00:00+01:00

Quite some years ago, whilst working on the EU Sixth Framework project Euro-limpacs, I organized a workshop on statistical methods for analyzing time series data. One of the sessions was on the analysis of extremes, ably given by Paul Northrop (UCL Department of Statistical Science). That intro certainly whet my appetite but I never quite found the time to dig into the arcane world of extreme value theory. Two recent events rekindled my interest in extremes; Simon Wood quietly introduced into his mgcv package a family function for the generalized extreme value distribution (GEV), and I was asked to review a paper on extremes in time series. Since then I’ve been investigating options for fitting models for extremes to environmental time series, especially those that allow for time-varying effects of covariates on the parameters of the GEV. One of the first things I did was sit down with mgcv to get a feel for the gevlss() family function that Simon had added to the package by repeating an analysis of a classic example data set that had been performed using the VGAM package of Thomas Yee.

The analysis I wanted to recreate was reported in a 2007 paper by Thomas Yee and Alec Stephenson (Yee and Stephenson, 2007) and concerned a time series of annual maximum sea-level at Fremantle, Western Australia. This example is also used extensively in Stuart Coles excellent book on statistical modeling of extremes (Coles, 2001). The data are available from the ismev support package for Coles’ book in the data set fremantle

## install.packages("ismev")               # if not installed!
data(fremantle, package = "ismev")
head(fremantle)

  Year SeaLevel   SOI
1 1897     1.58 -0.67
2 1898     1.71  0.57
3 1899     1.40  0.16
4 1900     1.34 -0.65
5 1901     1.43  0.06
7 1903     1.19  0.47

The data contain 86 observations of the annual maximum sea level (in meters) over the period 1897–1989. The aim of the analysis is to account for any change in the distribution of annual maxima over time and to investigate any relationship with the Southern Oscillation Index, a measure of meteorological phenomena which reflects the development and intensity of El Niño events, and those of its counterpart La Niña, in the south Pacific. The data are shown below using ggplot

library("mgcv")
library("ggplot2")
library("cowplot")                      # install.packages("cowplot") If not installed !
theme_set(theme_bw())

p1 <- ggplot(fremantle, aes(x = Year, y = SeaLevel)) +
    geom_line() + geom_smooth(se = FALSE)
p2 <- ggplot(fremantle, aes(x = SOI, y = SeaLevel)) +
    geom_point() + geom_smooth(se = FALSE)
plot_grid(p1, p2, ncol = 1)

Time series of annual sea-level maxima at Fremantle, Western Australia (top) and the relationship between annual sea-level maxima and the Southern Oscillation Index

In extreme value analysis, one of the key components is to assess the behaviour of the very large, or small, events/observations, and often the focus is on those that are much more extreme than those in the observational record. This requires a considerably different approach to the usual statistical methods that focus on the mean of a distribution. Whilst we could approach the analysis of data like that in fremantle from the view point of traditional methods employing the Gaussian distribution, the events of interest, the extreme high sea-level events, are way off in the tails of a distribution fitted by considering (usually) just its mean (and variance). Even small uncertainties in estimation of the distribution can be amplified when we get out into the extreme tails of the Gaussian, complicating inference about extremes and inflating uncertainties.

Extreme value theory has developed separate models and limiting distributions that replace central role that the Gaussian distribution plays in other areas of statistical modeling and inference. Consider again the sea-level data; the sea level would have been measured daily (or roughly daily) at Fremantle in order to produce the annual maximum series we wish to analyze. For a single year, we might denote these daily observations by (Z_1, , Z_m) and we’ll assume that these are a random sample of sea-level values. The annual maximum is given by

[ Y_m = { Z_1, , Z_m } ]

(Y_m) are commonly known as block maxima — the maxima of a block of random variables (Z_m). Extreme value theory considers the limiting distribution of (Y_m) as (m) tends to infinity. More simply, we want derive the distribution of annual maximum sea-level values as the number of annual maxima tends to infinity. The limiting distribution for (Y_m) is restricted to the class of generalized extreme value distributions (GEV), which have the following form

[ G(y) = { - _{+}^{-1/} } ]

where (), (> 0), and () are the location, (positive) scale, and shape parameters respectively of the distribution. The distribution has support on values (y) where (1 + (y - / ) > 0), which is indicated by the subscript (+) in the main equation above. () and () can take any real value in the range (-)–(+), whereas () the scale, or variance, parameter can be any positive real value.

The GEV distribution encompasses the three potential extreme value distributions for block maxima:

I. the Gumbel distribution, II. the Fréchet distribution, and III. the Weibul distribution.

These are also known as the Type I, II, and III extreme value distributions. Though I won’t write out the equations for each of these distributions, they are all quite similar to the GEV distribution and have parameters (), (), whilst the Fréchet and Weibull distributions also have a shape parameter (). The distributions differ markedly at the extreme positive end of (y), (y_{+}); the Weibull is finite, but both the Fréchet and Gumbel distributions are infinite, being distinguished by having polynomially and exponentially decaying density respectively. Each of these distributions can be reached from the GEV

the Gumbel is reached when (= 0),
the Fréchet when () is positive ((> 0)), and
the Weibull when () is negative ((< 0))

Traditionally, researchers had to decide which type of tail behaviour they expected prior to fitting one of the three extreme value distributions. The clear advantage of the GEV is that the choice of distribution is now a parameter that can be included in the model fitting process leading to fewer a priori decisions needing to be made ahead of the analysis.

As I mentioned above, the gevlss() family allows for separate linear predictors () for each of the parameters (), (), and (), to depend on one or more covariates. When setting up this model, therefore, we need to specify not one formula, but three. These are supplied in a list, with only the first having a left hand side term for the response.

The first model considered by Yee and Stephenson (2007) allowed for a smooth trend in Year and a smooth effect of SOI in the linear predictors for () and () whilst () was modeled as an intercept only linear predictor. The reason for the simple linear predictor for () is that this parameter is exceedingly difficult to estimate from data; in a relatively small data set like the fremantle one there is very little information with which to inform ().

To specify this model in gam() we need to create a list of three formula objects as follows:

list(SeaLevel ~ s(cYear) + s(SOI),
     ~ s(cYear) + s(SOI),
     ~ 1)

Key points to note here are

The ordering of the formula components is (), (), and (),
only the first formula, for (), has a left hand side specifying the response variable, in this case SeaLevel,
the second and third formulas are right-hand sided only and start with a ~,
intercept-only linear predictors are indicated by the formula ~ 1

This model can be thought of as an extended GLM and as such, each linear predictor is associated with a link function. The default links for (), (), and () in the gevlss() family are “identity”, “identity” and “logit” respectively, although technically the linear predictor for (), (_{}), is for the log scale parameter and hence the default identity link implies a fixed log link for (). Additionally, the “logit” link for () is modified to restrict the range to -1 – 0.5. To match the model fitted by Yee and Stephenson (2007), the identity link is used for all three parameters.

Finally, note that the VGAM package requires the user to specify the degrees of freedom for each smooth term and the software searches for a smoothing parameter that achieves the required degrees of freedom. mgcv takes a different tack; the user specifies the dimension of the basis (the number of basis functions) to use for each smooth term and then it chooses smoothness parameters via penalized likelihood to maximize a log-marginal or log-restricted marginal likelihood. Assuming that the dimension of the basis is sufficiently rich to include the true but unknown smooth function, the mgcv approach avoids the user having to state a priori how wiggly each smooth term should be.

Yee and Stephenson (2007) used three degrees of freedom splines for each smooth term. Here I leave the basis dimension at the (essentially arbitrary) default value of 10. It will be instructive to see what smoothness parameters are selected as optimal, how mgcv copes with estimating smoothness in a relatively complex setting, and how the estimated smooths compare with those assumed by Yee and Stephenson (2007).

One final tweak is required; the estimates of the intercept terms for () and () would imply extrapolation backwards in time 2,000 years. It can help numerical stability when fitting if we centre Year about say the middle of the time series, which we do now before proceeding

fremantle <- transform(fremantle, cYear = Year - median(Year))

With that out of the way, the model is fitted with relative ease as follows

m1 <- gam(list(SeaLevel ~ s(cYear) + s(SOI),
               ~ s(cYear) + s(SOI),
               ~ 1),
          data = fremantle, method = "REML",
          family = gevlss(link = list("identity", "identity", "identity")))
summary(m1)

Family: gevlss 
Link function: identity identity identity 

Formula:
SeaLevel ~ s(cYear) + s(SOI)
~s(cYear) + s(SOI)
~1

Parametric coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.49567    0.01517  98.577   <2e-16 ***
(Intercept).1 -2.13680    0.08853 -24.135   <2e-16 ***
(Intercept).2 -0.25472    0.08851  -2.878    0.004 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df Chi.sq  p-value    
s(cYear)   1.000  1.000 15.030 0.000106 ***
s(SOI)     1.366  1.650 13.549 0.000554 ***
s.1(cYear) 2.032  2.546  4.922 0.129164    
s.1(SOI)   1.000  1.000  6.461 0.011026 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Deviance explained =   NA%
-REML = -41.116  Scale est. = 1         n = 86

The summary() output is similar to that of standard GAMs, except the convention is to append .N, where N is a positive integer, to terms for (confusingly) the second and third linear predictors respectively. The parametric terms are listed first.

Of interest here for this model is the estimate of (), which is negative, -0.25 (with standard error 0.09 yielding approximate 95% confidence interval -0.43 – -0.08), indicating a Weibull-type distribution for the annual sea-level maxima. The values reported by Yee and Stephenson (2007) are () = -0.27, with standard error 0.06.

The smooth terms are listed next, and with the exception of the smooth of Year in (_{}), all the estimated smooths have been penalized to (effectively) linear functions. The partial effect of each smooth can be plotted using the plot() method for gam models

plot(m1, pages = 1, scheme = 1, scale = 0, seWithMean = TRUE)

Fitted smooths for model m1 which uses penalized splines for the smooths of Year and SOI in the linear predictors for the location and scale parameters of the GEV distribution

As reported in Yee and Stephenson (2007), the fitted smooth of Year in (_{}) (lower left panel) is somewhat non-linear, with partial effect of decreasing variance in sea-levels through c. 1945 and increasing variance thereafter. Yee and Stephenson (2007) suggest that this smooth may be replaced by a piece-wise linear function with a knot round 1945. The authors also simplified the model by replacing the smooths for all the other variables with linear parametric terms. We will investigate this model next.

I haven’t quite worked out how to get gam() to fit a piece-wise linear function yet, but the approach below is pretty close. The following model uses the new b-spline basis in mgcv, which allows a lot of control over how the basis is set up. In basic R, a piece-wise linear basis with interior knot at 1945 would be created using splines::bs(Year, degree = 1, knots = 1945), but then as far as gam() is concerned, the resulting basis functions are simply two continuous covariates that are treated a linear parametric terms.

We can use the new b-spline basis to achieve something similar to (the same as ?) splines::bs if we set the knot locations explicitly and use m = 1 (for linear splines) and basis dimension k = 3. If you are setting the knots manually, then for the b-spline basis in mgcv you need to specify k + m + 1 (5) knots and the middle k - m + 1 (3) knots should include all the covariate values. I’m not sure what determines where the two exterior knots should be located; in the code below I just locate the at +/- 10 years from the extremes of the data. The knot locations then are specified as a list with component cYear (to match the covariate name), and as we’re modeling with the centred Year, I centre the knot locations using the middle year as before.

knots <- with(fremantle, list(cYear = c(min(Year) - c(10, 0), 1945, max(Year) + c(0, 10)) - median(Year)))

The GAM can then be specified as before with three formulas. The type of smooth for cYear in (_{}) is specified via bs = “bs” and the remaining parameters of the basis are as described above. The list of knots we just created is passed to the knots argument.

m2 <- gam(list(SeaLevel ~ cYear + SOI,
               ~ s(cYear, bs = "bs", m = 1, k = 3) + SOI,
               ~ 1),
          data = fremantle, method = "REML",
          family = gevlss(link = list("identity", "identity", "identity")),
          knots = knots)

summary(m2)

Family: gevlss 
Link function: identity identity identity 

Formula:
SeaLevel ~ cYear + SOI
~s(cYear, bs = "bs", m = 1, k = 3) + SOI
~1

Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.5010226  0.0153061  98.067  < 2e-16 ***
cYear          0.0019503  0.0005139   3.795 0.000148 ***
SOI            0.0682778  0.0175807   3.884 0.000103 ***
(Intercept).1 -2.1230063  0.0882978 -24.044  < 2e-16 ***
SOI.1          0.2894395  0.1145038   2.528 0.011479 *  
(Intercept).2 -0.2543328  0.0885032  -2.874 0.004057 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df Chi.sq p-value  
s.1(cYear) 1.465      2  4.875   0.036 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Deviance explained =   NA%
-REML = -39.611  Scale est. = 1         n = 86

The summary output indicates significant linear parametric effects of cYear and SOI in ({}), and SOI in ({}). There is now some evidence of an effect of SOI on the variance of the block maxima, although we would be right to treat this result with caution as the piece-wise linear structure was only guessed at after fitting the more general smooth term, which was not statistically significant. Yee and Stephenson (2007) performed an informal deviance test between the two models, which we repeat here

lldif <- unclass(logLik(m1) - logLik(m2))
dfdif <- df.residual(m2) - df.residual(m1)
pchisq(2 * lldif, df = dfdif, lower.tail = FALSE)

[1] 0.3693541
attr(,"df")
[1] 9.1958

the results of which match those published and suggest that the simpler model with the piece-wise linear smooth of Year in (_{}) is sufficient to describe the effect on the variance of the sea-level maxima.

The fitted piece-wise linear smooth can be plotted using the plot() method as before. To get the linear terms plotted we need to use to all.terms = TRUE option

plot(m2, pages = 1, scheme = 1, seWithMean = FALSE, all.terms = TRUE)

Fitted smooths and parametric terms for model m2 which uses a piece-wise linear spline for the effect of Year on the scale parameter

This plot is a little more clunky than the previous one as the linear terms are plotted via calls to termplot() and the way this is achieved in plot.gam() doesn’t allow for separate y-axis limits for the linear terms (scale = 0) and the scheme argument does not affect these plots either.

If we wanted an entirely data-driven approach to fitting the smooth of Year in (_{}), and wanted to crack that particular nut with an industrial-sized wrecking ball, we could use the adaptive spline basis by changing the basis type for the smooth to bs = “ad” as follows (note this takes a while to fit)

m3 <- gam(list(SeaLevel ~ cYear + SOI,
               ~ s(cYear, bs = "ad") + SOI,
               ~ 1),
          data = fremantle, method = "REML",
          family = gevlss(link = list("identity", "identity", "identity")))

summary(m3)

Family: gevlss 
Link function: identity identity identity 

Formula:
SeaLevel ~ cYear + SOI
~s(cYear, bs = "ad") + SOI
~1

Parametric coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.5025023  0.0154310  97.369  < 2e-16 ***
cYear          0.0019770  0.0005175   3.820 0.000133 ***
SOI            0.0689087  0.0175007   3.937 8.23e-05 ***
(Intercept).1 -2.1203031  0.0899383 -23.575  < 2e-16 ***
SOI.1          0.2917915  0.1132973   2.575 0.010011 *  
(Intercept).2 -0.2677254  0.0935770  -2.861 0.004223 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
             edf Ref.df Chi.sq p-value  
s.1(cYear) 1.816  2.046   5.86  0.0563 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Deviance explained =   NA%
-REML = -40.975  Scale est. = 1         n = 86

Again, there is some evidence of a trend in the variance of the sea-level maxima; the higher p-value here likely reflects the additional uncertainty arising from having to deduce the shape and varying wiggliness of the spline from the data directly. The resulting smooth is largely indistinguishable from the piece-wise linear one in m2, except for the smooth transition around 1945.

plot(m3, pages = 1, scheme = 1, seWithMean = FALSE, all.terms = TRUE)

Fitted smooths and parametric terms for model m3 which uses an adaptive spline for the effect of Year on the scale parameter

My attempt to replicate the analysis of Yee and Stephenson (2007) was largely devoid of any troubles despite the gevlss() family being both new and described by Simon as “somewhat experimental”. The main difficulty was in trying to get a piece-wise linear spline within the mgcv framework, largely because doing it via splines::bs() makes it much more difficult to plot the partial effect of the overall function with the easily accessible tools that mgcv provides.

One area where mgcv is lacking in relation to VGAM for fitting GEV models is in the array of support functions that go with the fitted models — VGAM has lots of plot types specific to extreme value models that help with interpreting and checking the fitted model. In a future post I may try to tackle some of this using mgcv, if I find the time.

This is hopefully the first of several posts on modeling block maxima using mgcv and GAMs, so if you have any comments, suggests, corrections, let me know in the comments below.

References

Coles, S. (2001). An introduction to statistical modeling of extreme values: Springer London doi:10.1007/978-1-4471-3675-0.

Yee, T. W., and Stephenson, A. G. (2007). Vector generalized linear and additive extreme value models. Extremes 10, 1–19. doi:10.1007/s10687-007-0032-4.

Pangaea and R and open palaeo data

2016-12-16T00:00:00+01:00

For a while now, I’ve been wanting to experiment with rOpenSci’s pangaear package (Chamberlain et al., 2016), which allows you to search, and download data from, the Pangaea, a major data repository for the earth and environmental sciences. Earlier in the year, as a member of the editorial board of Scientific Data, Springer Nature’s open data journal I was handling a data descriptor submission that described a new 2,200-year foraminiferal δ^¹⁸O record from the Gulf of Taranto in the Ionian Sea (Taricco et al., 2016). The data descriptor was recently published and as part of the submission Carla Taricco deposited the data set in Pangaea. So, what better opportunity to test out pangaear? (Oh and to fit a GAM to the data while I’m at it!)

The post makes use of the following packages: pangaear (obviously), mgcv and ggplot2 for modelling and plotting, and tibble because pangaear returns search results and data sets in tibbles that I need to manipulate before I can fit a GAM to the δ^¹⁸O record.

library("pangaear")
library("tibble")
library("ggplot2")
theme_set(theme_bw())
library("mgcv")

To download a data set from Pangaea you need to know the DOI of the deposit. If you don’t know the Pangaear DOI, you can search the data records held by Pangaea for specific terms. In pangaear searching is done using the pg_search() function. To find the data set I want, I’m going to search for records that have the string “Taricco” in the citation.

recs <- pg_search(query = "citation:Taricco")
recs$citation

[1] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment cores from the Gulf of Taranto (Italy)"
[2] "Taricco, C; Alessio, S; Rubinetti, S et al. (2016): A foraminiferal d18O record of sediment core GT90-3 covering the last 2,200 years"                                    
[3] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment core GT89-3"                           
[4] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of a combined sediment core"                       
[5] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment core GT91-1"                           
[6] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment core GT90-3"

Assuming that the query didn’t time out (Pangaea can be a little slow to respond on occasion, so you might find increasing the timeout on the query helps), recs shoud contain 6 records with “Taricco” in the citation. The one we want is the second entry.

To download the data object(s) associated with a record in Pangaea, we use the pg_data() function, supplying it with a single DOI.

res <- pg_data(doi = recs$doi[2])       # doi = "10.1594/PANGAEA.857573"

Downloading 1 datasets from 10.1594/PANGAEA.857573

Processing 1 files

res
str(res[[1]], max = 1)

[[1]]
<Pangaea data> 10.1594/PANGAEA.857573
# A tibble: 560 × 4
   `Depth [m]` `Age [a AD]` `Age [ka BP]` `G. ruber d18O [per mil PDB]`
         <dbl>        <dbl>         <dbl>                         <dbl>
1       1.4000      -188.20       2.13820                         0.742
2       1.3975      -184.33       2.13433                         0.290
3       1.3950      -180.46       2.13046                         0.706
4       1.3925      -176.59       2.12659                         0.356
5       1.3900      -172.72       2.12272                         0.558
6       1.3875      -168.85       2.11885                         0.746
7       1.3850      -164.98       2.11498                         0.346
8       1.3825      -161.11       2.11111                         0.554
9       1.3800      -157.24       2.10724                         0.510
10      1.3775      -153.37       2.10337                         0.543
# ... with 550 more rows

List of 4
 $ doi     : chr "10.1594/PANGAEA.857573"
 $ citation:List of 1
  ..- attr(*, "class")= chr "citation"
 $ meta    :List of 1
  ..- attr(*, "class")= chr "meta"
 $ data    :Classes 'tbl_df', 'tbl' and 'data.frame':   560 obs. of  4 variables:
 - attr(*, "class")= chr "pangaea"

In Pangaea, a DOI might refer to a collection of data objects, in which case the object returned by pg_data() would be a list with as many components as objects in the collection. In this instance there is but a single data object associated with the requested DOI, but for consistency it is returned in a list with a single component.

Rather than work with the pangaea object directly, for modelling or plotting it is, for the moment at least, going to be simpler if we extract out the data object, which is stored in the $data</code> component. We’ll also want to tidy up those variable/column names <figure class="highlight"> <pre><code class="language-r" data-lang="r">foram <- res[[1]]$data names(foram) <- c(“Depth”, “Age_AD”, “Age_kaBP”, “d18O”) foram

# A tibble: 560 × 4
    Depth  Age_AD Age_kaBP  d18O
    <dbl>   <dbl>    <dbl> <dbl>
1  1.4000 -188.20  2.13820 0.742
2  1.3975 -184.33  2.13433 0.290
3  1.3950 -180.46  2.13046 0.706
4  1.3925 -176.59  2.12659 0.356
5  1.3900 -172.72  2.12272 0.558
6  1.3875 -168.85  2.11885 0.746
7  1.3850 -164.98  2.11498 0.346
8  1.3825 -161.11  2.11111 0.554
9  1.3800 -157.24  2.10724 0.510
10 1.3775 -153.37  2.10337 0.543
# ... with 550 more rows

Now that’s done, we can take a look at the data set

ylabel <- expression(delta^{18} * O ~ "[‰ VPDB]")
xlabel <- "Age [ka BP]"

ggplot(foram, aes(y = d18O, x = Age_kaBP)) +
    geom_path() +
    scale_x_reverse(sec.axis = sec_axis( ~ 1950 - (. * 1000), name = "Age [AD]")) +
    scale_y_reverse() +
    labs(y = ylabel, x = xlabel)

The δ^¹⁸O record of Taricco et al (2016)

Notice that the x-axis has been reversed on this plot so that as we move from left to right the observations become younger, as is standard for a time series. In the code block above I’ve used sec_axis() to add an AD scale to the x-axis. This is a new feature in version 2.2.0 of ggplot2 which allows secondary axis that is a one-to-one transformation of the main scale. This isn’t quite right as the two scales don’t map in a fully one-to-one fashion; as there is no year 0AD (or 0<abbrv, title = “Before Common Era”>BCE), the scale will be a year out for the BCE period.

Note too that the y-axis has also been reversed, to match the published versions of the data. This is done in those publications because δ^¹⁸O has an interpretation as temperature, with lower δ^¹⁸O indicating higher temperatures. As is common for data from proxies that have a temperature interpretation, the values are plotted in a way that up on the plot means warmer and down means colder.

To model the data in the same time ordered way using the year BP variable we need to create a variable that is the negative of Age-kaBP.

foram <- with(foram, add_column(foram, Age = - Age_kaBP))

Note that we don’t want to use the Age_AD scale for this as this has the problem of having a discontinuity at 0AD (which doesn’t exist).

Now we can fit a GAM to the δ^¹⁸O record

m <- gam(d18O ~ s(Age, k = 100, bs = "ad"), data = foram, method = "REML", select = TRUE)

In this instance I used an adaptive spline basis bs = “ad”, which allows the degree of wiggliness to vary along the fitted function. With a relatively large data set like this, which has over 500 observations, using an adaptive smoother can provide a better fit to the observations, and it is especially useful in situations where it is plausible that the response will vary more over some time periods than others. Adaptive smooths aren’t going to work well in short time series; there just isn’t the information available to estimate what in effect can be thought of as several separate splines over small chunks of the data. That said, I’ve had success with data sets with about 100–200 observations. Also note that fitting an adaptive smoother requires cranking the CPU over a lot more calculations; be aware of that if you throw a very large data set at it.

Also note that the model was fitted using REML — in most cases this is the default you want to be using as GCV can undersmooth in some circumstances. The double penalty approach of Marra and Wood (2011) is used here too (select = TRUE), which in this instance is being used to apply a bit of shrinkage to the fitted trend; it’s good to be a little conservative at times.

The model diagnostics look OK for this model and the check of sufficient dimensionality in the basis doesn’t indicate anything to worry about (partly because we used a large basis in the first place: 99 = k - 1 = 100 - 1)

gam.check(m)
## RStudio users might need
## layout(matrix(1:4, ncol = 2, byrow = TRUE))
## gam.check(m)
## layout(1)
## to see all the plots on one device

Method: REML   Optimizer: outer newton
full convergence after 26 iterations.
Gradient range [-0.0003244869,0.000148452]
(score -128.435 & scale 0.03324845).
Hessian positive definite, eigenvalue range [6.249446e-06,279.6116].
Model rank =  100 / 100 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

           k'    edf k-index p-value
s(Age) 99.000 15.539   0.993    0.44

Diagnostic plots for the fitted GAM

and the fitted trend is inconsistent with a null-model of no trend

summary(m)

Family: gaussian 
Link function: identity 

Formula:
d18O ~ s(Age, k = 100, bs = "ad")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.455329   0.007705   59.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df     F p-value    
s(Age) 15.54     99 3.357  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.373   Deviance explained =   39%
-REML = -128.44  Scale est. = 0.033248  n = 560

There is a lot of variation about the fitted trend, but a model with about 15 degrees of freedom explains about 40% of the variance in the data set, which is pretty good.

While we could use the provided plot() method for “gam” objects to draw the fitted function, I now find myself prefering plotting with ggplot2. To recreate the sort of plot that plot.gam() would produce, we first need to predict for a fine grid of values, here 200 values, over the observed time interval. predict.gam() is used to generate predictions and standard errors; the standard errors requested here use a new addition to mgcv which includes the extra uncertainty in the model because we are also estimating the smoothness parameters (the parameters that control the degree of wiggliness in the spline). This is achieved through the use of unconditional = TRUE in the call to predict(). The standard errors you get with the default, unconditional = FALSE, assume that the smoothness parameters, and therefore the amount of wiggliness, is known before fitting, which is rarely the case. This doesn’t make much difference in this example, but I thought I’d mention it as it is a relatively new addition to mgcv.

pred <- with(foram, data.frame(Age = -seq(min(Age_kaBP), max(Age_kaBP), length = 200)))
pred <- cbind(pred, as.data.frame(predict(m, pred, se.fit = TRUE, unconditional = TRUE)))
pred <- transform(pred,
                  Fitted = fit,
                  Upper = fit + (2 * se.fit),
                  Lower = fit - (2 * se.fit),
                  Age_kaBP = - Age)

The code above uses these standard errors to create an approximate 95% point-wise confidence on the fitted function, and prepares this in tidy format for plotting with ggplot().

ggplot(foram, aes(y = d18O, x = Age_kaBP)) +
    geom_point() +
    geom_ribbon(data = pred, mapping = aes(x = Age_kaBP, ymin = Lower, ymax = Upper),
                fill = "grey", colour = NA, alpha = 0.7, inherit.aes  = FALSE) +
    geom_path(data = pred, mapping = aes(x = Age_kaBP, y = Fitted), inherit.aes = FALSE,
              size = 1) +
    scale_x_reverse(sec.axis = sec_axis( ~ 1950 - (. * 1000), name = "Age [AD]")) +
    scale_y_reverse() +
    labs(y = ylabel, x = xlabel)

Observed δ^¹⁸O values with the fitted trend and 95% point-wise confidence interval superimposed

Taricco et al. (2009) used singular spectrum analysis (SSA), among other spectral methods, to decompose the δ^¹⁸O time series into components of variability with a range of periodicities. A visual comparison with the SSA components and the fitted GAM trend, suggests that the GAM trend maps on to the sum of the long-term trend component plus the ~600 year and (potentially) the 350 year frequency components of the SSA. This does make we wonder a little about how real the higher frequency components identified in the SSA are? No matter how hard I tried (even setting the basis dimension of the GAM to k = 500) I couldn’t get it to be more wiggly than shown in the plots above). Figure 4 of Taricco et al. (2009) also showed the spectral power for the 4 non-trend components from the SSA. The power associated with the 200-year and the 125-year components is substantially less than that of the two longer-frequency components. The significance of the SSA components was determined using a Monte Carlo approach (Allen and Smith, 1996), where surrogate time series are generate using AR(1) noise. It’s reasonable to ask whether this is a reasonable null model for these data? It’s also reasonable to ask whether the GAM approach I used above has sufficient statistical power to detect higher-freqency components if they actually exist? This warrants further study.

I started this post with some details on why I was prompted to look at this particular data set. Palaeo-scientists have a long record of sharing data — less so in some specific fields: yes, I’m looking at you (& me), palaeolimnologists — but, and perhaps this is just me, I’m seeing more of an open-data culture within palaeoecology and palaeoclimatology. This is great to see, and avenues for publishing and hence generating traditional academic merit for the data we generate will only help foster this. With my “editorial board member” hat on, I would encourage people to consider writing a data paper and submitting it to Scientific Data or one of the other data journals that are springing up. But, if you can’t or don’t want to do that, depositing your data in an open repository like Pangaea brings with it many benefits and is something that we the palaeo community should be supportive of. I wouldn’t have been writing this post if Tarrico and co-authors hadn’t chosen to make their data openly available.

And that brings me on to my final point for this post; having access to an excellent data repository like Pangaea from within a data analysis platform like R makes it so much easier to engaged with the literature and ask new and interesting questions. I’ve highlighted Pangaea here, but other initiatives are doing a great job of making palaeo data available and also deserve our recognition and support, like the Neotoma database; we might take access to these reources for granted, but implementing and maintaining web servers and APIs requires a lot of time, effort, and resources. Also, this post wouldn’t have been possible without the work of the wonderful rOpenSci community that make available R packages to query the APIs of online repositories like Pangaea and Neotoma. Thank you!

References

Allen, M. R., and Smith, L. A. (1996). Monte carlo SSA: Detecting irregular oscillations in the presence of colored noise. Journal of climate 9, 3373–3404. doi:10.1175/1520-0442(1996)009<3373:MCSDIO>2.0.CO;2.

Chamberlain, S., Woo, K., MacDonald, A., Zimmerman, N., and Simpson, G. (2016). Pangaear: Client for the ’pangaea’ database. Available at: https://CRAN.R-project.org/package=pangaear.

Marra, G., and Wood, S. N. (2011). Practical variable selection for generalized additive models. Computational statistics & data analysis 55, 2372–2387. doi:10.1016/j.csda.2011.02.004.

Taricco, C., Alessio, S., Rubinetti, S., Vivaldo, G., and Mancuso, S. (2016). A foraminiferal ()18O record covering the last 2,200 years. Scientific Data 3, 160042. doi:10.1038/sdata.2016.42.

Taricco, C., Ghil, M., Alessio, S., and Vivaldo, G. (2009). Two millennia of climate variability in the central mediterranean. Climate of the Past 5, 171–181. doi:10.5194/cp-5-171-2009.

Simultaneous intervals for smooths revisited

2016-12-15T00:00:00+01:00

Eighteen months ago I wrote a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalised spline. It was a nice post that attracted some interest. It was also wrong. I have no idea what I was thinking when I thought the intervals described in that post were simultaneous. Here I hope to rectify that past mistake.

I’ll tackle the issue of simultaneous intervals for the derivatives of penalised spline in a follow-up post. Here, I demonstrate one way to compute a simultaneous interval for a penalised spline in a fitted GAM. As example data, I’ll use the strontium isotope data set included in the SemiPar package, and which is extensively analyzed in the monograph Semiparametric Regression (Ruppert et al., 2003). First, load the packages we’ll need as well as the data, which is data set fossil. If you don’t have SemiPar installed, install it using install.packages(“SemiPar”) before proceeding

library("mgcv")
library("ggplot2")
theme_set(theme_bw())
data(fossil, package = "SemiPar")

The fossil data set includes two variables and is a time series of strontium isotope measurements on samples from a sediment core. The data are shown below using ggplot()

ggplot(fossil, aes(x = age, y = strontium.ratio)) +
    geom_point()

The strontium isotope example data used in the post

The aim of the analysis of these data is to model how the measured strontium isotope ratio changed through time, using a GAM to estimate the clearly non-linear change in the response. I won’t cover how the GAM is fitted and what all the options are here, but a reasonable GAM for these data is fitted using mgcv and gam()

m <- gam(strontium.ratio ~ s(age, k = 20), data = fossil, method = "REML")

The essentially arbitrary default for k, the basis dimension of the spline, is changed to 20 as there is a modest amount of non-linearity in the strontium isotopes ratio time series. By using method = “REML”, the penalised spline model is expressed as a linear mixed model with the wiggly bits of the spline treated as random effects, and is estimated using restricted maximum likelihood; method = “ML” would also work here.

The fitted model uses ~12 effective degrees of freedom (which wouldn’t have been achievable with the default of k = 10!)

summary(m)

Family: gaussian 
Link function: identity 

Formula:
strontium.ratio ~ s(age, k = 20)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.074e-01  2.435e-06  290527   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Ref.df     F p-value    
s(age) 11.52  13.88 62.07  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.891   Deviance explained = 90.3%
-REML = -932.05  Scale est. = 6.2839e-10  n = 106

The fitted spline captures the main variation in strontium isotope ratio values; the output from plot.gam() is shown below

plot(m, shade = TRUE, seWithMean = TRUE, residuals = TRUE, pch = 16, cex = 0.8)

The fitted penalised spline with approximate 95% point-wise confidence interval, as produced with plot.gam()

The confidence interval shown around the fitted spline is a 95% Bayesian credible interval. For reasons that don’t need to concern us right now, this interval has a surprising frequentist interpretation as a 95% “across the function” interval (Marra and Wood, 2012; Nychka, 1988); under repeated resampling from the population 95% of such confidence intervals will contain the true function. Such “across the function” intervals are quite intuitive, but, as we’ll see shortly, they don’t reflect the uncertainty in the fitted function; far fewer than 95% of splines drawn from the posterior distribution of the fitted GAM would lie within the confidence interval shown in the plot above.

How to compute a simultaneous interval for a spline is a well studied problem and a number of solutions have been proposed in the literature. Here I follow Ruppert et al. (2003) and use a simulation-based approach to generate a simultaneous interval. We proceed by considering a simultaneous confidence interval for a function (f(x)) at a set of (M) locations in (x); we’ll refer to these locations, following the notation of Ruppert et al. (2003) by

[ = (g_1, g_2, , g_M) ]

The true function over (), (), is defined as the vector of evaluations of (f) at each of the (M) locations

[ \[\begin{align} \mathbf{f_g} &\equiv \begin{bmatrix} f(g_1) \\ f(g_2) \\ \vdots \\ f({g_M}) \\ \end{bmatrix} \end{align}\] ]

and the corresponding estimate of the true function given by the fitted GAM as (). The difference between the true function and our unbiased estimator is given by

[ \[\begin{align} \mathbf{\hat{f}_g} - \mathbf{f_g} &= \mathbf{C_g} \begin{bmatrix} \boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\ \mathbf{\hat{u}} - \mathbf{u} \\ \end{bmatrix} \end{align}\] ]

where () is the evaluation of the basis functions at the locations (), and the thing in square brackets is the bias in the estimated model coefficients, which we assume to be mean 0 and follows, approximately, a multivariate normal distribution with mean vector () and covariance matrix ()

[ \[\begin{bmatrix} \boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\ \mathbf{\hat{u}} - \mathbf{u} \\ \end{bmatrix}\] N (, ) ]

Having got those definitions out of the way, the 100(1 - ())% simultaneous confidence interval is

where (m_{1 - }) is the 1 - () quantile of the random variable

[ {x } | | {1 M} | | ]

Yep, that was exactly my reaction when I first read this section of Ruppert et al. (2003)!

Let’s deal with the left-hand side of the equation first. The () refers to the supremum or the least upper bound; this is the least value of (), the set of all values of which we observed subset (x), that is greater than all of the values in the subset. Often this is the maximum value of the subset. This is what is indicated by the right-hand side of the equation; we want the maximum (absolute) value of the ratio over all values in ().

The fractions in both sides of the equation correspond to the standardized deviation between the true function and the model estimate, and we consider the maximum absolute standardized deviation. We don’t usually know the distribution of the maximum absolute standardized deviation but we need this to access its quantiles. However, we can closely approximate the distribution via simulation. The difference here is that rather than simulating from the posterior of the model as we have done in earlier posts on this blog, this time we simulate from the multivariate normal distribution with mean vector () and covariance matrix (), the Bayesian covariance matrix of the fitted model. For each simulation we find the maximum absolute standardized deviation of the fitted function from the true function over the grid of (x) values we are considering. Then we collect all these maxima, sort them and either take the 1 - () probability quantile of the maxima, or the maximum with rank ((1 - ) / N ).

OK, that’s enough of words and crazy equations. Implementing this in R is going to be easier than those equations might suggest. I’ll run through the code we need line by line. First we define a simple function to generate random values from a multivariate normal: this is in the manual for mgcv and saves us loading another package just for this:

rmvn <- function(n, mu, sig) { ## MVN random deviates
    L <- mroot(sig)
    m <- ncol(L)
    t(mu + L %*% matrix(rnorm(m*n), m, n))
}

Next we extract a few things that we need from the fitted GAM

Vb <- vcov(m)
newd <- with(fossil, data.frame(age = seq(min(age), max(age), length = 200)))
pred <- predict(m, newd, se.fit = TRUE)
se.fit <- pred$se.fit

The first is the Bayesian covariance matrix of the model coefficients, (). This () is conditional upon the smoothing parameter(s). If you want a version that adjusts for the smoothing parameters being estimated rather than known values, add unconditional = TRUE to the vcov() call. Second, we define our grid of (x) values over which we want a confidence band. Then we generate predictions and standard errors from the model for the grid of values. The last line just extracts out the standard errors of the fitted values for use later.

Now we are ready to generate simulations of the maximum absolute standardized deviation of the fitted model from the true model. We set the pseudo-random seed to make the results reproducible and specify the number of simulations to generate.

set.seed(42)
N <- 10000

Next, we want N draws from ( \[\begin{bmatrix} \boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\ \mathbf{\hat{u}} - \mathbf{u} \\ \end{bmatrix}\] ), which is approximately distributed multivariate normal with mean vector () and covariance matrix Vb

BUdiff <- rmvn(N, mu = rep(0, nrow(Vb)), sig = Vb)

Now we calculate ((x) - f(x)), which is given by ( \[\begin{bmatrix} \boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\ \mathbf{\hat{u}} - \mathbf{u} \\ \end{bmatrix}\] ) evaluated at the grid of (x) values

Cg <- predict(m, newd, type = "lpmatrix")
simDev <- Cg %*% t(BUdiff)

The first line evaluates the basis function at () and the second line computes the deviations between the fitted and true parameters. Then we find the absolute values of the standardized deviations from the true model. Here we do this in a single step for all simulations using sweep()

absDev <- abs(sweep(simDev, 1, se.fit, FUN = "/"))

The maximum of the absolute standardized deviations at the grid of (x) values for each simulation is computed via an apply() call

masd <- apply(absDev, 2L, max)

The last step is to find the critical value used to scale the standard errors to yield the simultaneous interval; here we calculate the critical value for a 95% simultaneous confidence interval/band

crit <- quantile(masd, prob = 0.95, type = 8)

The critical value estimated above is 3.205. Intervals generated using this value will be 1.6 times larger than the point-wise interval shown above.

Now that we have the critical value, we can calculate the simultaneous confidence interval. In the code block below I first add the grid of values (newd) to the fitted values and standard errors at those new values and then augment this with upper and lower limits for a 95% simultaneous confidence interval (uprS and lwrS), as well as the usual 95% point-wise intervals for comparison (uprP and lwrP). Then I plot the two intervals:

pred <- transform(cbind(data.frame(pred), newd),
                  uprP = fit + (2 * se.fit),
                  lwrP = fit - (2 * se.fit),
                  uprS = fit + (crit * se.fit),
                  lwrS = fit - (crit * se.fit))
ggplot(pred, aes(x = age)) +
    geom_ribbon(aes(ymin = lwrS, ymax = uprS), alpha = 0.2, fill = "red") +
    geom_ribbon(aes(ymin = lwrP, ymax = uprP), alpha = 0.2, fill = "red") +
    labs(y = "Strontium isotope ratio",
         x = "Age [Ma BP]")

Comparison of point-wise and simultaneous 95% confidence intervals for the fitted GAM

Finally, I’m going to look at the coverage properties of the interval we just created, which is something I should have done in the older post as it would have shown, as we’ll see, that the old interval I wrote about wasn’t even close to having the correct coverage properties.

Start by drawing a large sample from the posterior distribution of the fitted model. Note that this time, we’re simulating from a multivariate normal with mean vector given by the estimated model coefficients

sims <- rmvn(N, mu = coef(m), sig = Vb)
fits <- Cg %*% t(sims)

fits now contains N = 10⁴ draws from the model posterior. Before we look at how many of the 10⁴ samples from the posterior are entirely contained within the simultaneous interval, choose 30 at random and stack them in so-called tidy form for use with ggplot()

nrnd <- 30
rnd <- sample(N, nrnd)
stackFits <- stack(as.data.frame(fits[, rnd]))
stackFits <- transform(stackFits, age = rep(newd$age, length(rnd)))

What we’ve done in this post can be summarized in the figure below

ggplot(pred, aes(x = age, y = fit)) +
    geom_ribbon(aes(ymin = lwrS, ymax = uprS), alpha = 0.2, fill = "red") +
    geom_ribbon(aes(ymin = lwrP, ymax = uprP), alpha = 0.2, fill = "red") +
    geom_path(lwd = 2) +
    geom_path(data = stackFits, mapping = aes(y = values, x = age, group = ind),
              alpha = 0.4, colour = "grey20") +
    labs(y = "Strontium isotope ratio",
         x = "Age [Ma BP]",
         title = "Point-wise & Simultaneous 95% confidence intervals for fitted GAM",
         subtitle = sprintf("Each line is one of %i draws from the Bayesian posterior distribution of the model", nrnd))

Summary plot showing 30 random draws from the model posterior and approximate 95% simultaneous and point-wise confidence intervals for the the fitted GAM

It shows the fitted model and the 95% simultaneous and point-wise confidence intervals, and is augmented with 30 draws from the posterior distribution of the GAM. As you can see, many of the lines lie outside the point-wise confidence interval. The situation is quite different with the simultaneous interval; only a couple of the posterior draws go outside of the 95% simultaneous interval, which is what we’d expect for a 95% interval. So that’s encouraging!

As a final check we’ll look at the proportion of all the posterior simulations that lie entirely within the simultaneous interval. To facilitate this we create a little wrapper function, inCI(), which returns TRUE if all the evaluation points () lie within the stated interval and FALSE otherwise. This is then applied to each posterior simulation (column of fits) and we do this for the simultaneous intervals and the point-wise version. The final two lines work out what proportion of the posterior simulations lie within the two confidence intervals.

inCI <- function(x, upr, lwr) {
    all(x >= lwr & x <= upr)
}

fitsInPCI <- apply(fits, 2L, inCI, upr = pred$uprP, lwr = pred$lwrP)
fitsInSCI <- apply(fits, 2L, inCI, upr = pred$uprS, lwr = pred$lwrS)

sum(fitsInPCI) / length(fitsInPCI)      # Point-wise
sum(fitsInSCI) / length(fitsInSCI)      # Simultaneous

[1] 0.3028
[1] 0.9526

As you can see, the point-wise confidence interval includes just a small proportion of the posterior simulations, but the simultaneous interval contains approximately the right number of simulations for a 95% interval.

So how bad are the intervals I created in the old post? They should as bad as the 95% point-wise interval, and they are

oldCI <- apply(fits, 1L, quantile, probs = c(0.025, 0.975))
pred <- transform(pred, lwrOld = oldCI[1, ], uprOld = oldCI[2, ])
fitsInOldCI <- apply(fits, 2L, inCI, upr = pred$uprOld, lwr = pred$lwrOld)
sum(fitsInOldCI) / length(fitsInOldCI)

[1] 0.2655

So, there we have it — a proper 95% simultaneous confidence interval for a penalised spline. Now I just need to go back to that old post and strike out all reference to simultaneous…

References

Marra, G., and Wood, S. N. (2012). Coverage properties of confidence intervals for generalized additive model components. Scandinavian journal of statistics, theory and applications 39, 53–74. doi:10.1111/j.1467-9469.2011.00760.x.

Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. Journal of the American Statistical Association 83, 1134–1143. doi:10.1080/01621459.1988.10478711.

Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press.

ISEC 2016 Talk

2016-07-02T00:00:00+02:00

My ISEC 2016 talk, Estimating temporal change in mean and variance of community composition via location, scale additive models, describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores.

Using data from two varved lakes

Lake 227, Experimental Lakes Area, Ontario, Canada, and
Baldeggersee, Switzerland,

I use location scale generalised additive models to simultaneously model the mean (trend) and the variance of time series of fossil algal pigments and diatom counts.

These techniques may be applied to data from less ideal situations, where observations are irregularly sampled in time and have varying sample intervals/effects of time averaging.

The slide deck can be downloaded from Figshare.

Rootograms

2016-06-07T00:00:00+02:00

Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across Kleiber and Zeileis (2016) who propose the rootogram as an improved approach to the assessment of fit of a count regression model. The paper is illustrated using R and the authors’ countreg package (currently on R-Forge only). Here, I thought I’d take a quick look at the rootogram with some simulated species abundance data.

Start by simulating some data to work with. Here I use my coenocliner package, and simulate three data sets, each of which uses the same environmental gradient, but with counts drawn from the following distributions

Poisson
Negative binomial
Zero-inflated negative binomial

To follow along here you’ll need the latest version of coenocliner from CRAN (>= 0.2-2) as a bit of a bug entered into my code when changing between parameterizations of the negative binomial.

Load coenocliner and set up

library("coenocliner")

## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10)     # environmental locations
A0 <- 90                                  # maximal abundance
mu <- 3                                   # position on gradient of optima
alpha <- 1.5                              # parameter of beta response
gamma <- 4                                # parameter of beta response
r <- 6                                    # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5                           # overdispersion parameter 1/theta
zprobs <- 0.3                             # prob(y == 0) in binomial model

Now we can simulate counts for the 100 locations along the gradient for each of the three count models

pois <- coenocline(locs, responseModel = "beta", params = pars, countModel = "poisson")
nb   <- coenocline(locs, responseModel = "beta", params = pars, countModel = "negbin",
                   countParams = list(alpha = nb.alpha))
zinb <- coenocline(locs, responseModel = "beta", params = pars, countModel = "ZINB",
                   countParams = list(alpha = nb.alpha, zprobs = zprobs))

and combine them into a data frame with the gradient locations

df <- setNames(cbind.data.frame(locs, pois, nb, zinb),
               c("x", "yPois", "yNegBin", "yZINB"))

To each of these I’m going to fit a Poisson GLM to show how rootograms can facilitate model evaluation where we know what the underlying model is so we can see what might happen when the wrong model, in this case a Poisson GLM, is fitted to data

glm.pois <- glm(yPois ~ poly(x, 2), data = df, family = poisson)
glm.nb   <- glm(yNegBin ~ poly(x, 2), data = df, family = poisson)
glm.zinb <- glm(yZINB ~ poly(x, 2), data = df, family = poisson)

In each case, a Poisson GLM was fitted even though we knew that for yNegBin and yZINB that the data generating process was not the Poisson.¹

Next, generate rootograms for each of these models. I start by loading the countreg package as well as ggplot2, as I’ll plot the rootograms using the latter rather than base graphics.

Load the countreg package and ggplot. If you don’t have countreg installed, install it from R Forge using install.packages(“countreg”, repos=“http://R-Forge.R-project.org”)

library("countreg")
library("ggplot2")

Rootograms are calculated using the rootogram() function. You can provide the observed and expected (given the model) counts as arguments to rootogram() or, most usefully for our purposes, a fitted count model object from which the relevant values will be extracted. rootogram() knows about glm, gam, gamlss, hurdle, and zeroinfl objects at the time of writing.

Three different kinds of rootograms are discussed in the paper

Standing,
Hanging, and
Suspended.

Kleiber and Zeileis (2016) recommend hanging or suspended rootograms, for reasons I’ll mention shortly. Which type of rootogram is produced is controlled via argument style. The final option I use below is plot = FALSE, which suppresses plotting of the rootogram as I want to do that later using ggplot.

Generate the three rootograms

root.pois <- rootogram(glm.pois, style = "hanging", plot = FALSE)
root.nb   <- rootogram(glm.nb, style = "hanging", plot = FALSE)
root.zinb <- rootogram(glm.zinb, style = "hanging", plot = FALSE)

and gather them into an object for plotting — notice I’m using the autoplot() method to generate ggplot2 plot objects, and adjusting the limits to make the plots comparable. The resulting figure is shown below the code

ylims <- ylim(-2, 7)  # common scale for comparison
plot_grid(autoplot(root.pois) + ylims, autoplot(root.nb) + ylims, 
          autoplot(root.zinb) + ylims, ncol = 3, labels = "auto")

Hanging rootograms for a Poisson GLM fitted to simulated Poisson (a), negative binomial (b), and zero-inflated negative binomial (c) count data

Looking first at panel a we see the main features of the rootogram:

expected counts, given the model, are shown by the thick red line,
observed counts are shown as bars, which in a hanging rootogram are show hanging from the red line of expected counts,
on the x-axis we have the count bin, 0 count, 1 count, 2 count, etc,
on the y-axis we have the square root of the observed or expected count — the square root transformation allows for departures from expectations to be seen even at small frequencies
A reference line is drawn at a height of 0

Because this is a hanging rootogram, we can think of the rootogram as relating to the fitted counts — if a bar doesn’t reach the zero line then the model over predicts a particular count bin, and if the bar exceeds the zero line it under predicts.

For the Poisson GLM fitted to counts generated from a Poisson distribution (panel a) we see general good agreement between the expected and observed counts, with a small amount of under prediction of some counts between 10–20. For the Poisson GLM fitted to the data generated from a negative binomial distribution (panel b) we see a much poorer fit — the zero count is under predicted whilst some low counts are over predicted, and a large number of count bins are under predicted between 4 and 10 counts. Focusing on the bottom of the bars we see an undulating pattern with runs either above or below the zero reference line, highlighting a general lack of fit in the model.

The fit of the Poisson GLM to data generated using a ZINB also shows considerable model lack of fit; strong under prediction of the zero bin and over prediction of the 1 count bin, with perhaps some general over prediction across most bins.

It is useful to compare rootograms showing the fits for incorrect and correct models side by side. To that end next I fit a negative binomial GLM and a ZINB using the glm.nb() function from package MASS and the zeroinfl() function from package countreg respectively, and create the relevant rootograms

library("MASS")
glm2.nb   <- glm.nb(yNegBin ~ poly(x, 2), data = df)
glm2.zinb <- zeroinfl(yZINB ~ poly(x, 2) | 1, data = df, dist = "negbin")
## create rootograms
root2.nb   <- rootogram(glm2.nb, style = "hanging", plot = FALSE)
root2.zinb <- rootogram(glm2.zinb, style = "hanging", plot = FALSE)

First, we look at the negative binomial data and compare rootograms of the Poisson and negative binomial model fits

plot_grid(autoplot(root.nb) + ylims, autoplot(root2.nb) + ylims, ncol = 2, labels = "auto")

Hanging rootograms for Poisson GLM (a) and negative binomial model (b) fits to the simulated negative binomial count data

The rootogram for the negative binomial GLM fit (panel b) shows much better agreement with the data than that of the Poisson fit (panel a). Departures from expected counts are much smaller and the zero-count bin is much better fitted. Some small deviations from the observed data remain but that is to be expected.

Next we compare rootograms for the fits of the Poisson GLM and ZINB model

ylims <- ylim(-2, 8.5)
plot_grid(autoplot(root.zinb) + ylims, autoplot(root2.zinb) + ylims, ncol = 2, labels = "auto")

Hanging rootograms for Poisson GLM (a) and zero-inflated negative binomial model (b) fits to the simulated zero-inflated negative binomial count data

The rootogram for the ZINB model (panel b) shows better agreement with the zero-count bin than the Poisson model (panel a), though fits for the remaining count bins are similar to one another in both models. In particular, the ZINB model is still over predicting single counts.

Suspended rootograms are also recommended by Kleiber and Zeileis (2016). These rootograms show the difference between observed and expected counts, with bars hanging from the zero-line rather than the expected count line. Therefore we can think of this rootogram as showing information about the model residuals rather than the fitted values of the hanging rootogram. A suspended rootogram is produced using style = “suspended” and an example, for the ZINB model, is shown below

autoplot(rootogram(glm2.zinb, style = "suspended", plot = FALSE))

Suspended rootogram for a zero-inflated negative binomial model fitted to the simulated zero-inflated negative binomial count data

Standing histograms are not recommended by Kleiber and Zeileis (2016) as they simply show the expected and observed counts and the user then has to compare the height of each bar with the expected curve for each bin. By tying the bars to the expected curve or zero reference line in hanging or suspended rootograms, the assessment of fit is made by comparison of deviations from the reference line rather than bin-by-bin comparison of observed and expected counts. A standing rootogram, for completeness, is shown below for the ZINB model

autoplot(rootogram(glm2.zinb, style = "standing", plot = FALSE))

Standing rootogram for a zero-inflated negative binomial model fitted to the simulated zero-inflated negative binomial count data

A neat feature of the countreg package is that rootograms can be combined using the c() or cbind() methods, which makes plotting multiple rootograms much more simple than I showed above. For example, to compare the Poisson and negative binomial model fits to the negative binomial counts one could have used

autoplot(c(root.nb, root2.nb))

Result of plotting two rootograms that were combined using cbind()

So, there we go; these are rootograms and they seem like a pretty useful tool for assessing fits of count models. I really recommend having a look at Kleiber and Zeileis (2016) as it contains much more discussion and illustration of the proposed rootograms than I could possibly include here. They also have a nice ecological example of data from an investigation into horseshoe crab mating plus two other examples. Their paper will shortly appear in the journal The American Statistician, although at the time of writing I don’t have citation details for that version of the paper.

References

Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms. Available at: http://arxiv.org/abs/1605.01311.

I’m kind of glossing over the fact that a quadratic function of x is not really the true model here, which is a generalised beta response function. This kind of sets up a follow-up post using a GAM fit…↩

Harvesting more Canadian climate data

2016-05-24T00:00:00+02:00

A while back I wrote some code to download climate data from Government of Canada’s historical climate/weather data website for one of our students. In May this year (2016) the Government of Canada changed their website a little and the API code that responded to requests had changed URL and some of the GET parameters had also changed. In fixing those functions I also noted that the original code only downloaded hourly data and not all useful weather variables are recorded hourly; precipitation for example is only in the daily and monthly data formats. This post updates the earlier one, explaining what changed and how the code has been updated. As an added benefit, the functions can now handle downloading daily and monthly data files as well as the hourly files that the original could handle.

Screenshot of Government of Canada’s climate website

The genURLS() function now has an extra argument timeframe which allows you to select which type of data to download, defaulting to hourly data:

genURLS <- function(id, start, end, timeframe = c("hourly", "daily", "monthly")) {
    years <- seq(start, end, by = 1)
    nyears <- length(years)
    timeframe <- match.arg(timeframe)
    if (isTRUE(all.equal(timeframe, "hourly"))) {
        years <-  rep(years, each = 12)
        months <- rep(1:12, times = nyears)
        ids <- rep(id, nyears * 12)
    } else if (isTRUE(all.equal(timeframe, "daily"))) {
        months <- 1                      # this is essentially arbitrary & ignored if daily
        ids <- rep(id, nyears)
    } else {
        years <- start                   # again arbitrary, for monthly it just gives you all data
        months <- 1                      # and this is also ignored
        ids <- id
    }
    timeframe <- match(timeframe, c("hourly", "daily", "monthly"))
    URLS <- paste0("http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=", id,
                   "&Year=", years,
                   "&Month=", months,
                   "&Day=14",
                   "&format=csv",
                   "&timeframe=", timeframe,
                   "&submit=%20Download+Data"## need this stoopid thing as of 11-May-2016
                   )
    list(urls = URLS, ids = ids, years = years, months = rep(months, length.out = length(URLS)))
}

If we wanted all the data for 2014 for the Regina RCS station then we could generate the URLs we’d need to visit as follows

regina <- genURLS(28011, 2014, 2014)
length(regina$urls)
head(regina$urls)

[1] 12
[1] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=1&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[2] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=2&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[3] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=3&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[4] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=4&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[5] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=5&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[6] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=6&Day=14&format=csv&timeframe=1&submit=%20Download+Data"

The function that downloads and reads in the data was getData().

getData <- function(stations, folder, timeframe = c("hourly", "daily", "monthly"), verbose = TRUE, delete = TRUE) {
    timeframe <- match.arg(timeframe)
    ## form URLS
    urls <- lapply(seq_len(NROW(stations)),
                   function(i, stations, timeframe) {
                       genURLS(stations$StationID[i],
                               stations$start[i],
                               stations$end[i], timeframe = timeframe)
                   }, stations = stations, timeframe = timeframe)

    ## check the folder exists and try to create it if not
    if (!file.exists(folder)) {
        warning(paste("Directory:", folder,
                      "doesn't exist. Will create it"))
        fc <- try(dir.create(folder))
        if (inherits(fc, "try-error")) {
            stop("Failed to create directory '", folder,
                 "'. Check path and permissions.", sep = "")
        }
    }

    ## Extract the data from the URLs generation
    URLS <- unlist(lapply(urls, '[[', "urls"))
    sites <- unlist(lapply(urls, '[[', "ids"))
    years <- unlist(lapply(urls, '[[', "years"))
    months <- unlist(lapply(urls, '[[', "months"))

    ## filenames to use to save the data
    fnames <- paste(sites, years, months, "data.csv", sep = "-")
    fnames <- file.path(folder, fnames)

    nfiles <- length(fnames)

    ## set up a progress bar if being verbose
    if (isTRUE(verbose)) {
        pb <- txtProgressBar(min = 0, max = nfiles, style = 3)
        on.exit(close(pb))
    }

    out <- vector(mode = "list", length = nfiles)
    hourlyNames <- c("Date/Time", "Year", "Month","Day", "Time", "Data Quality",
                     "Temp (degC)", "Temp Flag", "Dew Point Temp (degC)",
                     "Dew Point Temp Flag", "Rel Hum (%)", "Rel Hum Flag",
                     "Wind Dir (10s deg)", "Wind Dir Flag", "Wind Spd (km/h)",
                     "Wind Spd Flag", "Visibility (km)", "Visibility Flag",
                     "Stn Press (kPa)", "Stn Press Flag", "Hmdx", "Hmdx Flag",
                     "Wind Chill", "Wind Chill Flag", "Weather")
    dailyNames <- c("Date/Time", "Year", "Month", "Day", "Data Quality", "Max Temp (degC)", "Max Temp Flag",
                    "Min Temp (degC)", "Min Temp Flag", "Mean Temp (degC)", "Mean Temp Flag",
                    "Heat Deg Days (degC)", "Heat Deg Days Flag", "Cool Deg Days (degC)", "Cool Deg Days Flag",
                    "Total Rain (mm)", "Total Rain Flag", "Total Snow (cm)", "Total Snow Flag",
                    "Total Precip (mm)", "Total Precip Flag", "Snow on Grnd (cm)", "Snow on Grnd Flag",
                    "Dir of Max Gust (10s deg)", "Dir of Max Gust Flag", "Spd of Max Gust (10s deg)", "Spd of Max Gust Flag")
    monthlyNames <- c("Date/Time", "Year", "Month",
                      "Mean Max Temp (degC)", "Mean Max Temp Flag",
                      "Mean Min Temp (degC)", "Mean Min Temp Flag",
                      "Mean Temp (degC)", "Mean Temp Flag",
                      "Extr Max Temp (degC)", "Extr Max Temp Flag",
                      "Extr Min Temp (degC)", "Extr Min Temp Flag",
                      "Total Rain (mm)", "Total Rain Flag",
                      "Total Snow (cm)", "Total Snow Flag",
                      "Total Precip (mm)", "Total Precip Flag",
                      "Snow Grnd Last Day (cm)", "Snow Grnd Last Day Flag",
                      "Dir of Max Gust (10s deg)", "Dir of Max Gust Flag",
                      "Spd of Max Gust (10s deg)", "Spd of Max Gust Flag")

    cnames <- switch(timeframe, hourly = hourlyNames, daily = dailyNames, monthly = monthlyNames)
    TIMEFRAME <- match(timeframe, c("hourly", "daily", "monthly"))
    SKIP <- c(16, 25, 18)[TIMEFRAME]

    for (i in seq_len(nfiles)) {
        curfile <- fnames[i]

        ## Have we downloaded the file before?
        if (!file.exists(curfile)) {    # No: download it
            dload <- try(download.file(URLS[i], destfile = curfile, quiet = TRUE))
            if (inherits(dload, "try-error")) { # If problem, store failed URL...
                out[[i]] <- URLS[i]
                if (isTRUE(verbose)) {
                    setTxtProgressBar(pb, value = i) # update progress bar...
                }
        next                             # bail out of current iteration
            }
        }

        ## Must have downloaded, try to read file
        ## skip first SKIP rows of header stuff
        ## encoding must be latin1 or will fail - may still be problems with character set
        cdata <- try(read.csv(curfile, skip = SKIP, encoding = "latin1", stringsAsFactors = FALSE), silent = TRUE)

        ## Did we have a problem reading the data?
        if (inherits(cdata, "try-error")) { # yes handle read problem
            ## try to fix the problem with dodgy characters
            cdata <- readLines(curfile) # read all lines in file
            cdata <- iconv(cdata, from = "latin1", to = "UTF-8")
            writeLines(cdata, curfile)          # write the data back to the file
            ## try to read the file again, if still an error, bail out
            cdata <- try(read.csv(curfile, skip = SKIP, encoding = "UTF-8", stringsAsFactors = FALSE), silent = TRUE)
            if (inherits(cdata, "try-error")) { # yes, still!, handle read problem
                if (delete) {
                    file.remove(curfile) # remove file if a problem & deleting
                }
                out[[i]] <- URLS[i]    # record failed URL...
                if (isTRUE(verbose)) {
                    setTxtProgressBar(pb, value = i) # update progress bar...
                }
                next                  # bail out of current iteration
            }
        }

        ## Must have (eventually) read file OK, add station data
        cdata <- cbind.data.frame(StationID = rep(sites[i], NROW(cdata)),
                                  cdata)
        names(cdata)[-1] <- cnames
        out[[i]] <- cdata

        if (isTRUE(verbose)) { # Update the progress bar
            setTxtProgressBar(pb, value = i)
        }
    }

    out                                 # return
}

The main infelicity is that you have to supply the getData() with a data frame containing the station IDs and start and end years respectively for the data you want to collect. This suited my needs as we wanted to grab data from 10 stations with different start and end years as required to track station movements. It’s not as convenient if you only want to grab the data for a single station, however.

getData() gains the same timeframe argument as genURLS(). In addition, to handle the, quite frankly odd!, choice of characters used in the various flag columns, I now do conversion of the file encoding from latin1 to UTF-8 using the iconv() function. Whether this works portably or not remains to be seen — I’m not that familiar with file encodings. If it doesn’t work, an option would be to determine what the user’s locale is and from that change the encoding to the native encoding.

One thing you’ll note quickly if you start downloading data using this function is that the web script the Government of Canada is using on their climate website will quite happily generate a fully-formed file containing no actual data (but with all the headers, hourly time stamps, etc) if you ask it for data outside the window of observations for a given station. There are no errors, just lots of mostly empty files, bar the header and labels.

One other thing to note is that getData() returns the downloaded data as a list and no attempt is made to flatten the individual components to a single large data frame. That’s because it allows for any failed data downloads (or reads) and records the failed URL instead of the data. This gives you a chance to manually check those URLs to see what the problem might be before re-running the job, which because we saved all the CSVs will run very quickly from that local cache.

The use of data.frames internally is showing signs of being a bit of a bottleneck performance-wise; rbind()-ing many stations or files of data takes a long time. I plan on changing the code to use tbl_dfs now that Hadley has moved that functionality to the tibble package. I am reliably informed that bind_rows() is much quicker.

The eagle-eyed among you will notice the dreaded stringsAsFactors = FALSE in the definition of getData(). I’m beginning to see why people that work with messy data find the default stringsAsFactors = TRUE down right abhorrent!

To see getData() in action, we’ll run a quick job, downloading the 2014 data for two stations

Regina INTL A (51441)
Indian Head CDA (2925)

First we create a data frame of station information

stations <- data.frame(StationID = c(51441, 2925),
                       start = rep(2014, 2),
                       end = rep(2014, 2))

Then we pass this to getData() with the path to the folder we wish to cache downloaded CSVs in

met <- getData(stations, folder = "./csv", verbose = FALSE)

Warning in getData(stations, folder = "./csv", verbose = FALSE):
Directory: ./csv doesn't exist. Will create it

This will take a few minutes to run, even for just 24 files, as the site is not the quickest to respond to requests (or perhaps they are now throttling my workstation’s IP?). Note I turned off the printing of the progress bar here, only because this doesn’t play nicely with knitr’s capturing of the output. In real use, you’ll want to leave the progress bar on (which it is by default) so you see how long you have to wait till the job is done.

Once this has finished, we can quickly determine if there were any failures

any(failed <- sapply(met, is.character))

[1] FALSE

If any had failed, the failed logical vector could be used to index into met to extract the URLs that encountered problems, e.g.

unlist(met[failed])

If there were no problems, then the components of met can be bound into a data frame using rbind()

met <- do.call("rbind", met)

The data now looks like this

head(met)

  StationID        Date/Time Year Month Day  Time Data Quality Temp (degC)
1     51441 2014-01-01 00:00 2014     1   1 00:00       \u0087       -23.3
2     51441 2014-01-01 01:00 2014     1   1 01:00       \u0087       -23.1
3     51441 2014-01-01 02:00 2014     1   1 02:00       \u0087       -22.8
4     51441 2014-01-01 03:00 2014     1   1 03:00       \u0087       -23.3
5     51441 2014-01-01 04:00 2014     1   1 04:00       \u0087       -24.3
6     51441 2014-01-01 05:00 2014     1   1 05:00       \u0087       -24.3
  Temp Flag Dew Point Temp (degC) Dew Point Temp Flag Rel Hum (%)
1                           -26.3                              77
2                           -26.1                              77
3                           -25.8                              77
4                           -26.3                              77
5                           -27.1                              78
6                           -27.0                              79
  Rel Hum Flag Wind Dir (10s deg) Wind Dir Flag Wind Spd (km/h)
1                              13          <NA>              22
2                              12          <NA>              26
3                              12          <NA>              22
4                              13          <NA>              18
5                              13          <NA>              14
6                               9          <NA>               6
  Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa)
1                          19.3            <NA>           95.38
2                          24.1            <NA>           95.38
3                          24.1            <NA>           95.39
4                          24.1            <NA>           95.47
5                          24.1            <NA>           95.56
6                          24.1            <NA>           95.60
  Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag
1                  NA        NA        -35              NA
2                  NA        NA        -36              NA
3                  NA        NA        -35              NA
4                  NA        NA        -34              NA
5                  NA        NA        -34              NA
6                  NA        NA        -30              NA
            Weather
1 Snow,Blowing Snow
2 Snow,Blowing Snow
3 Snow,Blowing Snow
4 Snow,Blowing Snow
5              Snow
6              <NA>

Yep, still a bit of a mess; some post processing is required if you want tidy names etc. The columns names are hardcoded but retain the messy names as given to them by the Government of Canada’s webmaster. Cleaning up afterwards is remains advised.

A final note, I could have run this over all the cores in my workstation or even on all the computers in my small computer cluster but I didn’t, instead choosing to run on a single core overnight to get the data we needed. Please be a good netizen if you do use the functions I’ve discussed here as other people will no doubt want to access the Government of Canada’s website. Don’t flood the site with requests!

If you have any suggestions for improvements or changes, let me know in the comments. The latest versions of the genURLS() and getData() functions can be found in this Github gist.

A new default plot for multivariate dispersions

2016-04-17T00:00:00+02:00

This weekend, prompted by a pull request from Michael Friendly, I finally got round to improving the plot method for betadisper() in the vegan package. betadisper() is an implementation of Marti Anderson’s Permdisp method, a multivariate analogue of Levene’s test for homogeneity of variances. In improving the default plot and allowing customisation of plot features, I was reminded of how much I dislike programming plot functions that use base graphics. But don’t worry, this isn’t going to degenerate into a ggplot love-in nor a David Robinson-esque dig at Jeff Leek.

The original plot method for betadisper() hardcoded all the linetypes, colours etc for features on the plot. I didn’t mind this on bit; ordination plots are difficult to programme, and, to get anything half-way publishable, the user will usually need to build a plot up from component parts using the low-level tools we provide. Also, it’s kind of a theme in vegan to provide a useful, but not neccessarily pretty, default plot for our plot methods, whilst allowing for all manner of customisation via lower level methods like points() and lines(), plus custom tools such as ordiellipse() and ordiarrows().

However, in practice users it seems aren’t always satisfied with this situation and expect default plots to be, well, more.

In its original incarnation, plot.betadisper() showed data points and group centroids embedded in a principal coordinates-derived Euclidean space, with convex hulls enclosing each group’s data points and line segments joining data points with their respective centroid. Centroids were in red, segments blue, and hulls black, all of which were hard-coded. More egregiously, the plot didn’t provide any indication of which group was which. I was OK with this as the principal coordinates plot was only really meant as a visualisation of what the method did; other plots and analyses that we provided in vegan were needed to assess significance of differences in dispersions etc.

There was nothing stopping me, however, from providing a more featureful version with full user control over the various aspects of the plot. Nothing that is except a deep reluctance to write — in the first place — and then subsequently maintain a function with a gabillion tortuously named arguments to differentiate the half dozen settings of cex et al for different features.

There’s a real trade off between flexibility and complexity in plot methods like this. The situation is much easier to manage with lower-level functions to draw the individual features of the plot; invariably each lower-level tool requires a smaller subset of parameters, and if you code your function well, you can usually achieve all you need by passing … on to the low-level base graphics functions your function uses. You can’t do this with a plot method that combines several lower-level features into a single plot; if you want to allow the user to independently control the colour of three separate plot features you’re going to need three different variations on the argument col. Multiply that by all the parameters you want to allow the user to tweak, and you have the recipe for a mess. Either that, or you need to accept lists of parameters for each feature, which aren’t exactly intuitive for casual users.

With the new plot.betadisper() method I took a compromise position, allowing some additional flexibility whilst limiting the argument bloat that is an unfortunate side effect of high-level base graphics plot methods.

## you'll need the development version of vegan from github for this
## devtools::install_github("vegandevs/vegan")
library("vegan")
args(vegan:::plot.betadisper)

function (x, axes = c(1, 2), cex = 0.7, pch = seq_len(ng), col = NULL, 
    lty = "solid", lwd = 1, hull = TRUE, ellipse = FALSE, ellipse.type = c("sd", 
        "se"), ellipse.conf = NULL, segments = TRUE, seg.col = "grey", 
    seg.lty = lty, seg.lwd = lwd, label = TRUE, label.cex = 1, 
    ylab, xlab, main, sub, ...) 
NULL

Michael Friendly supplied code to allow some of the original plotting parameters to take vectors, one per group to facilitate their differentiation. I extended this to allow couple more standard parameters to be set by the user. Rather than have separate settings for convex hulls and confidence ellipses, both use the same general parameters. Only the line segments between data points and their centroid get any special treatment, in the main because they add quite of lot of components to the plot and being able to style them to sit in the background is quite useful.

We’ll look at the new plot using the main example in ?betadisper

data(varespec)                  # load example data 
dis <- vegdist(varespec)        # Bray-Curtis distances between samples

## First 16 sites grazed, remaining 8 sites ungrazed
groups <- factor(c(rep(1,16), rep(2,8)), labels = c("grazed","ungrazed"))

mod <- betadisper(dis, groups)  # Calculate multivariate dispersions

Given mod the plot method produces a labelled plot with convex hulls and line segments

plot(mod)

The new default plot produced by plot.betadisper()

Also at the suggestion of Michael Friendly, I added code to draw confidence ellipses, of which there are several flavours

standard deviation ellipses
standard error ellipses

with the default being to draw a 1 standard deviations ellipse (ellipse.conf controls how many standard deviations or errors are drawn, or which 1 - α confidence ellipse is drawn.)

plot(mod, hull = FALSE, ellipse = TRUE)

An alternate plot produced by plot.betadisper() showing 1 standard deviation ellipses about the group medians.

As a default plot, the new version is lot nicer and affords the user a reasonable level of flexibility to customise the plot without the number of arguments exploding uncontrollably. The code used to produce this is now a good deal more complex and because I grafted it on to the existing code it probably isn’t a clean or efficient as it could be.

The new function also reaffirms my dislike of providing high-level plot functions for a package that uses base graphics. As a means for producing plots, I like base graphics, for certain things. However, I’m also comfortable building plots up from low-level parts and can easily write code to quickly produce the plot I want. Clearly, from the emails and questions I receive, not all the users of betadisper() are so able or inclined. Providing a reasonable level of customisation to a higher level plot using base graphics is an exercise in tediousness and inelegance. It doesn’t look nice to add dozens of arguments just to enable the user to tweak a dozen tiny features of the plot. I also find it demotivating writing code like this and the accompanying documentation.

In this regard, ggplot is a much better system for producing customisable higher-level plots. All of the code for handling grouping, colours, line types etc is built into aesthetics and geoms, and a theme or customised palette or scale (such as the increasingly popular one supplied by the viridis package) allows a concise and principled way of changing the look and feel of a plot that tranfers across all plots created using ggplot. If you want to customise plot.betadisper’s output, you need to learn the half dozen particular arguments that I chose to implement. Yet once learned are these skills useful elsewhere? If you’re lucky, you can expect some semblance of consistency across a package, but beyond that, the user ends up having to learn the particulars of the plotting functions in each of the packages they end up using.

This is wasted effort and a considerable obstacle to overcome as a new R user. It’s taken me a while — largely because on its own ggplot lacks features needed for every-day use by an academic — to realise this, but I’m glad I have. If anything, whilst I am pleased with the changes made to plot.betadisper(), my resolve to spend more time working on ggvegan over the summer has strengthened as a direct result of writing this base graphics code.

I never expected to find myself writing that…

LOESS revisited

2016-04-10T00:00:00+02:00

It’s fair to say I have gotten a bee¹ in my bonnet about how palaeolimnologists handle time. For a group of people for whom time is everything, we sure do a poor job (in general) of dealing with it in when it comes time to analyse our data. In many instances, “poor job” means making no attempt at all to account for the special nature of the time series. LOESS comes in for particular criticism because it is widely used by palaeolimnologists despite not being particularly suited to the task. Why this is so is perhaps due to it’s promotion in influential books, papers, and software. I am far from innocent in this regard having taught LOESS and it’s use for many years on the now-defunct ECRC Numerical Course. Here I want to look at further problems with our use of LOESS, and will argue that we need to resign it to the trash can for all but exploratory analyses. I will begin the case for the prosecution with one of my own transgressions.

Consensus reconstruction based upon the reconstructed pH values for the fossil samples in the Round Loch of Glenhead (RLGH3) core of all three reconstruction methods; one-component weighted averaging partial least squares model (WAPLS(1)), maximum likelihood (ML) and modern analogue technique (MAT)). The consensus reconstruction has been generated using a LOESS smoother fitted to the inferred pH values as a function of sample age with a span of 0.1. Reproduced from Figure 19.3 from Simpson and Hall (2012).

The figure above comes from one of the chapters I wrote in the infamous Numerical Methods book in the Developments in Paleoenvironmental Research series (Simpson and Hall, 2012). The aim here was to show the common pattern in reconstructed pH using three different calibration methods. In my defence, this was intended as an diagnostic plot but this may not have been clear in the text. I’d certainly be embarrassed if anyone took this usage to be any indication of how to go about hypothesis “testing” on the trend in reconstructed pH.

There are two things wrong with this plot/usage

The usual problem; no justification for the span used (0.1)
Failure to account for between-method variation

If you are doing an exploratory analysis, the choice of span is somewhat arbitrary; it doesn’t really matter what you use, and you might use several spans to get a feeling for potential features that may be present in the data. However, if you are planning on using the trends or features identified in this exploratory analysis to support some idea or hypothesis, then you’re going to get into a world of trouble.

First, there’s the potential for over-fitting. This is actually quite high with palaeo data as all but the most skeleton of sequences will have some amount of autocorrelation; something I’ve covered before.

Second, just because you get a particular “fit” using this span it doesn’t mean the identified features in the smoother are significant. Can they be distinguished from the noisy background? Answering this question requires estimates of the uncertainty in the fitted function and calculation of the derivatives of the fitted smooth curve. The first derivative of the fitted smooth is equivalent to the slope (or coefficient) of a simple linear regression. In this model, we assess whether the estimated slope is consistent with the null hypothesis of a trend of 0 using a t statistic, which is the value of the slope estimate divided by its uncertainty (the standard error). Conceptually we can think of this as forming a 100(1 - α) confidence interval and asking if 0 (the null hypothesis slope value) is included within this interval.

The equivalent for smoothers and splines is to compute the first derivative of the fitted smooth. Doing this analytically is often not straightforward, but we can use the method of finite differences to approximate the first derivative of the smoother. Using standard errors of the derivative or posterior simulation we can compute confidence intervals on the derivatives and thus determine where along the curve there is sufficient evidence to reject the null hypothesis of no trend.

The linked posts explain this process and illustrate it using generalised additive models. The key point to remember though is that the model, even a LOESS one, fitted to the data, is uncertain; it contains a degree of uncertainty because we’ve estimated things from the sample of data we happened to collect. As a result, it is inappropriate to simply interpret a fitted trend as is, without also considering the uncertainty in the estimation of the trend.

The other important thing that often gets overlooked is the bias variance trade off. If you fit a wiggly trend as compared to a smooth trend, all things being equal, the wiggly one will have lower bias and higher variance and the smooth one higher bias and lower variance. Here, by variance we mean uncertainty; change the data a bit and high variance fits will change a lot, hence the high uncertainty. With LOESS smoothers, low span values fit potentially high variance low bias models. Invariably these will be over-fitted, and highly uncertain, unless there is a lot of data from which to estimate such a wiggly trend and you’ve properly accounted for the stochastic properties of the data such as any autocorrelation.

The other problem with the consensus reconstruction in the above figure is the failure to account for the between-method variance and the correlation between fitted values derived using the same calibration method. Such problems are commonly handled with a mixed effects model, but as we only have three “subjects”, that isn’t an option here. Ideally then, we’d fit three separate trends, one per method, plus a separate mean for each method. Then we could compare this model with one that had a separate mean per method but just a single common trend. The key point to remember here is that the residuals should not contain much or any trace of a trend nor of which method was used. In the figure above this is clearly not the case and as a result it makes it difficult to do formal inference on the fitted smoother.

A more recent example of the latter point is Hobbs et al. (2016). Below, I reproduce figures 6 and 9 from the paper (Hobbs et al., 2016)

DCA axis 1 scores for all 19 lakes with diatom paleoecological records. LOESS smooth curve for each park area shows the general trend of diatom community turnover through time. Shaded bars represent the timing of significant shifts in the diatom assemblages (details in the supporting information). Reproduced from Figure 6 from Hobbs et al. (2016).

Here the problems of failing to account for core-specific trends are worse than my earlier example. In Figure 6, Hobbs et al. (2016) show DCA axis 1 scores for cores from parks around the great lakes, grouped at the park level such that each panel includes data from at least three different cores. The first problem here is that throughout the authors use LOESS but never state how they determined the span used in the figures. The second issue is that the reader can’t unpack the site-specific trends because the data for each site isn’t differentiated by plotting symbols or colour. Most importantly however, we see clear evidence that the LOESS trend is different to some or all of the trends or even the data, especially in the VOYA and SLBE panels. This is not so much showing a consensus but rewriting history entirely — the fitted trend in some places doesn’t even go anywhere near the data! This is an ever-present problem with this kind of analysis.

Worse still is Figure 9 from the same paper (Hobbs et al., 2016), shown below

Sediment δ¹⁵N from all cores standardized as z scores. Loess smooth curve in red. Figure 9 from Hobbs et al. (2016).

This figure shows an impressive amount of δ¹⁵N values of bulk organic matter from many cores across the study region. Whilst it is clear if you look at the detail that there are many more lower δ¹⁵N values around the turn of the 20^th century than before, the individual trends in δ¹⁵N are all obfuscated by the presentation. It is not clear what the LOESS smoother is showing at all; as it is a scatter plot smoother, it is showing pattern in the data points irrespective of grouping at the core level. As such we can’t expect that it is representative of a common trend at all; which is what the authors surely hoped it would do!

The z-score standardisation (centring and standardising each core to have zero mean and unit variance) used here also complicates the interpretation; the axis is no longer in δ¹⁵N values ‰ but in standard deviation units from each core mean. By giving each core the same variance we actually gloss over differences in variance which might have ecological or environmental significance. It would be better to model these features explicitly.

A solution?

It’s all well and good being critical of my work or that of others’, but unless that critique comes with suggestions for ways to do better in the future, as a field we can’t progress. So, what could be done to provide a better analysis in both these cases? Two things in particular spring to mind

fit an explicit model that includes terms mapped to the features of the data, and
properly estimate the degree of smoothness in the data/trend

Here on this blog I’ve discussed ways to handle point 2², and I have some additional thoughts based on new types of smoothers and ideas from spatial statistics that happen to fit in with the GAM approach and spline bases. These ideas form the basis of a paper I’m writing at the moment.

Point 1 could be handled in a variety of ways;

Fit a stochastic time series model using either maximum likelihood methods or Bayesian estimation. Such models include state-space formulations of the classic ARIMA-type models. Such models can account for site-specific effects, underlying latent trends that we have noisy observations from, and a proper accounting of the irregular sampling and change of support³ inherent to most sediment core records.
For the consensus reconstruction example, a GAM with three separate trends or a common trend plus three separate departure trends would allow the explicit modelling of the features of interest. This is most easily achieved using by variable smooths in the mgcv package using gam(). If fitting a common trend and site specific departures from this common trend, the site specific departures need to be modelled using penalties on the first derivative (usually penalties are on the second derivative) to penalise departure from a flat function which represents no departure from the common trend.
For the Figure 6 example from Hobbs et al. (2016), there are enough cores to potentially model them as random effects, again as site specific trends or as common trend plus site-specific departures. The random effect splines are an efficient way of fitting many trends, and can be fitted using the factor-smooth interaction basis (s(time, fac, bs = “fs”) using gam() in mgcv for smooths of time for each level of factor fac) or via tensor product smooths combining a marginal smooth for time and a marginal random effect spline for each level of fac.
For the Figure 9 example, there are certainly enough cores to warrant a random effect spline approach as mentioned above.

This blog post is already long-enough and I don’t have time to go into specific details of fitting random effect splines, by-variable splines, or splines based on ideas from kriging, here. In the next few months I’ll write up posts on these methods as both areas are being developed into manuscripts; the random effect spline methods is a collaboration with Eric Pedersen, David Miller, and Noam Ross.

In the consensus reconstruction and both the Hobbs et al. (2016) examples a strong argument can be made for modelling a common trend plus site specific departures because in both cases interest is in trying to identify common trend and detailed site- or method-specific trends are of secondary concern.

Whither LOESS?

Where does this leave LOESS? I think it is clear that LOESS is perfectly acceptable as an exploratory method only. It makes few assumptions about the data and because the user needs to specify a span/bandwidth parameter it alows for interactive investigation of a range of potential temporal trends of varying smoothness. As a more formal method for fitting models with which one can actually answer scientific questions, LOESS is far less useful. This isn’t the fault of LOESS; it was designed as a scatterplot smoother, not for fitting multivariate time series models. The issue is rather in our reliance on LOESS without understanding or acknowledging its deficiencies for actual model fitting.

The problem of arbitrary choices of span parameters in LOESS can be worked around with a cross-validation procedure suited to handling temporally autocorrelated data. But the multivariate time series issues I’ve discussed in detail here are less easily solved. It’s not that they can’t be solved; the original GAM software used LOESS smooths as part of the formal GAM procedure. But this software doesn’t make it easy to fit common trend plus site-specific difference trends as would be required for both the examples discussed above. The gam() function from mgcv does allow this to be done with relative ease, hence this approach is something I’ve been exploring. The Bayesian approaches are probably our best solution long-term to modelling palaeoecological data because of their flexibility. But that flexibility comes at a price; complexity.

And that brings me to my final point. As a field, palaeolimnology really needs to take more seriously training in quantitative methods, and in particular modern methods such as the GAMs that I’ve found most useful and in Bayesian techniques in general. Where young palaeolimnologists get any training it is most often in the traditional methods that were adopted from a time before we had real computing power available to use and before Statistics, the science, had developed methods to really handle the sorts of data we were generating. We are currently going through a revolution in the development of methods for use with multivariate ecological data and complex time series data. Palaeolimnologists risk being left behind here and this worries me. A lot. I mainly worry because at best we are paying lip service to the deficiencies in the field in terms of our quantitative prowess. And it’s is beginning to show in the quality of science we do and the ways we try to answer important ecological and environmental questions.

I find this troubling indeed…

References

Hobbs, W. O., Lafrancois, B. M., Stottlemyer, R., Toczydlowski, D., Engstrom, D. R., Edlund, M. B., et al. (2016). Nitrogen deposition to lakes in national parks of the western great lakes region: Isotopic signatures, watershed retention, and algal shifts. Global biogeochemical cycles, 2015GB005228. doi:10.1002/2015GB005228.

Simpson, G. L., and Hall, R. I. (2012). “Human impacts: Applications of numerical methods to evaluate Surface-Water acidification and eutrophication,” in Tracking environmental change using lake sediments Developments in paleoenvironmental research. (Springer Netherlands), 579–614. doi:10.1007/978-94-007-2745-8_19.

an entire hive is perhaps more apt!↩
here, here, and here for example↩
If you think about what we record in our sediment samples, it is clear that this sequence is a highly modified version of the real per-unit time sedimentation that occurred in the lake. Hence we wish to make inference on something we haven’t actually observed directly. We can model the unobserved sequence as a latent trend polluted by noise. Because of compaction and bioturbation etc, each sediment slice represents different amounts of time. In other words each observation is supported by contributions from one or more unit-time observations from the unobserved latent trend. Samples from 100 years ago might be support by 4 years of observations from the latent process, but near the top of the core a single year from the latent process might be represented in each of the observations. This problem/feature is known as change of support.↩

Soap-film smoothers & lake bathymetries

2016-03-27T00:00:00+01:00

A number of years ago, whilst I was still working at ENSIS, the consultancy arm of the ECRC at UCL, I worked on a project for the (then) Countryside Council for Wales (CCW; now part of Natural Resources Wales). I don’t recall why they were doing this project, but we were tasked with producing a standardised set of bathymetric maps for Welsh lakes. The brief called for the bathymetries to be provided in standard GIS formats. Either CCW’s project manager or the project lead at ENSIS had proposed to use inverse distance weighting (IWD) to smooth the point bathymetric measurements. This probably stemmed from the person that initiatied our bathymetric programme at ENSIS being a GIS wizard, schooled in the ways of ArcGIS. My involvement was mainly data processing of the IDW results. I was however, at the time, also somewhat familiar with the problem of finite area smoothing¹ and had read a paper of Simon Wood’s on his then new soap-film smoother (Wood et al., 2008). So, as well as writing scripts to process and present the IDW-based bathymetry data in the report, I snuck a task into the work programme that allowed me to investigate using soap-film smoothers for modelling lake bathymetric data. The timing was never great to write up this method (two children and a move to Canada have occurred since the end of this project), so I’ve not done anything with the idea. Until now…

In this post, I want to introduce the concept of finite area smoothing and illustrate the use of soap-film smoothers in modelling lake bathymetric data.

Finite area smoothing

Often, we seek to model a response over a well-defined region with a known boundary. This problem is known as finite area smoothing, or as Ramsay put it, smoothing over difficult regions (2002). Why this problem is more difficult than it sounds is well illustrated by the test function introduced by Ramsay (2002) a version of which is shown below².

library("mgcv")
fsb <- fs.boundary()
m <- 300
n <- 150 
xm <- seq(-1, 4, length = m)
yn <- seq(-1, 1, length = n)
xx <- rep(xm, n)
yy <- rep(yn, rep(m, n))
tru <- matrix(fs.test(xx, yy), m, n) ## truth
truth <- data.frame(x = xx, y = yy, value = as.vector(tru))

library("ggplot2")
library("viridis")
theme_set(theme_bw())
p <- ggplot(truth, aes(x = x, y = y)) +
    geom_raster(aes(fill = value)) +
    geom_contour(aes(z = value), binwidth = 0.5, colour = "white") +
    geom_path(data = as.data.frame(fsb), aes(x = x, y = y)) +
    scale_fill_viridis(na.value = NA) +
    theme(legend.position = "top", legend.key.width = unit(2.5, "cm"))
p

Ramsay’s test function

The domain of the test is a rotated U shape. Each stem of the U has quite different values of the response, achieved by smoothly varying the response along the U itself. Between the two stems is a barrier in the spatial domain. Smoothing across this barrier would bleed information from one side to the other, which would lead to poorly predicted values. One solution to the problem of smoothing inside domains such as the one shown is to smooth only considering distances between points within the domain, not distances over some bounding box of the problem. In other words, we shouldn’t assume points either side of the barrier in the test function are similar just because they are closely located in the y coordinate.

Soap-film smoothers

Bubble artists can do some amazing things with a few props and copious amounts of soapy solutions. If you’ve ever seen a bubble artist perform, you’ll never look at the little bottles of bubbles that kids use to blow simple round bubbles in the same way again. Whilst soap-film smoothers aren’t quite as amazing as the soapy wonders produced by bubble artists, how they work is directly related to one form of bubble art³.

If we start from the simple kids-toy version for blowing round bubbles, then you’ll know that there is a small loop within which an exceedingly thin film of soapy liquid is contained. Blowing through the loop deforms the soapy film, and if you blow gently, eventually you can deform the film enough that it detaches from the loop and forms a perfect, iridescent, soapy ball of fun⁴. Bubble artists employ more complex loops, but the principle remains the same and in practice, this is exactly how soap-film smoothers work.

Return for a moment to Ramsay’s test function shown above. The loop is formed by the boundary of the domain. Imagine a soapy film suspended within this loop, and further imagine that we can somehow blow over the region to deform the film in such as way as to move the film towards the data. In the test function above, we’d need to “blow” on the film so that it deformed towards us in the upper stem of the U, and away from us in the lower stem (assuming that we’re mapping the data values to the z coordinate.) Quite a lot of complexity underlies exactly how the soap-film smoother achieves this, but the general principle is exceedingly simple.

Soap-film smoothers comprise two separate types of smoother; one for the boundary and one for the film itself. The boundary smoother is often a cyclic spline in order to have the ends of the spline join nicely at the “end points” of the boundary. If the value of the response at the boundary is known, such as lake depth being zero at the margin of the lake, then the boundary can be fixed at these values without needing a spline to model values on the boundary. If the response is not known at the boundary, it can be estimated using the boundary spline.

Lake bathymetric data

What do soap films have to do with lake bathymetric data? Basically because the problem of modelling depth soundings is exactly the same as that illustrated by Ramsay’s test function. We have a well defined boundary⁵, and all but the most simple lakes have shoreline features that we don’t want to smooth across, such as peninsulars⁶.

The figure below shows lake depth soundings from the Comeston Park Lakes, two now-flooded former quarries joined by a narrow channel.

library("rgdal")

## Update this if I can post the Comeston data
dataDIR <- "/home/gavin/work/projects/ccw/data/CCW_Final_Data/42721_Cosmeston_Park/."
outline <- readOGR(dataDIR, "42721_Cosmeston_Lake_lake_polyline")
depth <- readOGR(dataDIR, "d17_42721_xyz")

foutline <- fortify(outline)
fdepth <- data.frame(depth)

ggplot(foutline, aes(x = long, y = lat)) +
    geom_path() +
    geom_point(data = fdepth, aes(x = os_x, y = os_y, colour = depth)) +
    coord_fixed() + ylab("Northing") + xlab("Easting") +
    scale_color_viridis()

Comeston Park Lakes depth sounding data

This example is very similar to that of Ramsay’s test function. We don’t want to smooth across the narrow peninsular because there is no reason to presume the bed topography is the same on either side.

Additive models for lake bathymetry data

If we weren’t worried about the boundary, we could use a thin plate spline smoother (TPRS) to model how depth varies spatially. The TPRS basis is perfect for this as the x y data are in the same units. Hence a simple GAM would seem OK, if were weren’t worried about those pesky boundaries.

The wrong thing then would be to do the following, here not using the lake boundary information of zero depths.

library("mgcv")
crds <- coordinates(outline)[[1]][[1]]
tprs <- gam(-depth ~ s(os_x, os_y, k = 60), data = depth, method = "REML")
summary(tprs)

Family: gaussian 
Link function: identity 

Formula:
-depth ~ s(os_x, os_y, k = 60)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.35075    0.07813  -68.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(os_x,os_y) 41.45  51.06 19.08  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.787   Deviance explained =   82%
-REML = 491.88  Scale est. = 1.6175    n = 265

The fitted smoother uses about 40 degrees of freedom, and explains about 80% of the variance in the observed depths. To visualise the fitted surface, I create a data set of x and y coordinates over the bounding box of the spatial data. At this stage I’m not going to remove any of the points for prediction that are outside the lake as I want to show what the TPRS smoother is doing. The code basically

sets up a 2.5 meter-resolution grid in the x and y directions
predicts from the model at each location
creates a temporary version of the predictions, setting all depths > 0 to NA, which removes some distracting behaviour far from the support of the observations.

grid.x <- with(tprs$var.summary,
               seq(min(c(os_x, crds[,1])), max(c(os_x, crds[,1])), by = 2.5))
grid.y <- with(tprs$var.summary,
               seq(min(c(os_y, crds[,2])), max(c(os_y, crds[,2])), by = 2.5))
pdata <- with(tprs$var.summary, expand.grid(os_x = grid.x, ox_y = grid.y))
names(pdata) <- c("os_x","os_y")
##predictions
pdata <- transform(pdata, Depth = predict(tprs, pdata, type = "response"))
tmp <- pdata                         # temporary version...
take <- with(tmp, Depth > 0)        # getting rid of > 0 depth points
tmp$Depth[take] <- NA

The TPRS fitted surface is plotted with the observed data using

ggplot(foutline, aes(x = long, y = lat)) +
    geom_raster(data = tmp, aes(x = os_x, y = os_y, fill = Depth)) +
    geom_path() +
    geom_point(data = fdepth, aes(x = os_x, y = os_y), size = 0.5) +
    coord_fixed() + ylab("Northing") + xlab("Easting") +
    scale_fill_viridis(na.value = NA)

Predicted depths over the bounding box of the observations from the TPRS smoother GAM.

I’ve purposely done a poor visualisation job⁷ in the above figure as I wanted to show how the TPRS smoother bleeds information across the peninsular. Ignore the predictions off into the top left & bottom right: concentrate on the peninsular. The TPRS spline is smoothing depth across this region, exactly what we don’t want. It’s almost as if the peninsular isn’t there.

Next we’ll fit the soap-film smoother version. I’ll take this one a bit slower as we have some work to do to set up the boundary and knot locations that the smoother needs.

For lake bathymetries we have two set-up jobs to complete

create a boundary object, with known value of 0
choose the number of knots and their locations over the domain of interest

The second is, in my experience, most easily achieved by using the list form of the allowed options for the boundary⁸. The list form for the boundary is a list within a list. Each sublist has at least two elements containing the x and y coordinates of the boundary polygon. A component f may also be included, which sets the boundary condition at each location; here we set this to 0 to indicate the depth tends to 0 at the lake shore. In the code below I create this from the coordinates() object created earlier.

bound <- list(list(x = crds[,1], y = crds[,2], f = rep(0, nrow(crds))))

Choosing the number and location of knots is trickier, especially if you are trying to automate this for a large number of lakes. The key requirement is that any knots are contained entirely within the lake boundary. mgcv provides the inSide() function to facilitate this. Unfortunately inSide() doesn’t provide exactly the same check for being inside the boundary as the one used by the soap-film smooth constructor called when you fit the model. The procedure I outline below is the one I’ve found most useful to date, but I make no guarantee that it is optimal nor that it will work for your data problem⁹.

Here I choose to create a 10 by 10 regular grid of locations over the bounding box of the coordinates. From this grid I retain those points that are contained within the lake boundary.

N <- 10
gx <- seq(min(crds[,1]), max(crds[,1]), len = N)
gy <- seq(min(crds[,2]), max(crds[,2]), len = N)
gp <- expand.grid(gx, gy)
names(gp) <- c("x","y")
knots <- gp[with(gp, inSide(bound, x, y)), ]
names(knots) <- c("os_x", "os_y")
names(bound[[1]]) <- c("os_x", "os_y", "f")

The last two lines set boundary and knots names to match the variable names on the depth data used to fit the model.

The choice of 10 for the sides of the grid is useful here as that puts enough points within the lake for the knots of the smoother, but doesn’t require any nudging of the grid to get the selected points to fall nicely within the boundary. In other examples, I’ve needed to tailor the number of points in the grid and shift it by a few meters to get as many of the regular points to fall inside the boundary. You may even find that you need to locate the knots individually. Using locator() after plotting the lake outline is an expedient — but entirely manual — way to do this if you have too.

What this process looks like is shown in the figure below

Illustration of the knot selection procedure. The large circles are the locations of the sparse regular grid of points over the bounding box of the data. The filled red circles are those grid points that are found inside the lake boundary and thus chosen as knots for the soap-film smoother. The small black dots are the locations of the observed depth data.

Fitting the soap-film model is quite similar to any other GAM you may have fitted with mgcv. The main exception is that you have to pass something to the xt argument of s(). If you delve into some of the more complex smoothers that have become available in mgcv in recent releases, you’ll find yourself using xt a lot as it is the way to pass extra information to the basis constructor functions.

For soap-film smoothers you must pass xt a list with component bnd set to an appropriate boundary object — here bound as created earlier. The knots that were created earlier need to be passed to the knots argument. The full call to gam() is shown below; the soap-film basis is specified using bs = “so”.

m2 <- gam(-depth ~ s(os_x, os_y, bs = "so", xt = list(bnd = bound)),
          data = depth, method = "REML", knots = knots)

The soap-film smoother explains just over 75% of the variance in the data, using just under 30 degrees of freedom. It doesn’t explain quite as much variance as the TPRS model I looked at earlier, but is substantially simpler in terms of degrees of freedom (~30 vs ~40 respectively).

summary(m2)

Family: gaussian 
Link function: identity 

Formula:
-depth ~ s(os_x, os_y, bs = "so", xt = list(bnd = bound))

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -3.275      0.204  -16.05   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(os_x,os_y) 27.27     38 19.95  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.742   Deviance explained = 76.8%
-REML = 501.32  Scale est. = 1.9607    n = 265

Soap-film GAMs come with their own plot() method

lims <- apply(crds, 2, range)
ylim <- lims[,2]
xlim <- lims[,1]

plot(m2, asp = 1, ylim = ylim, xlim = xlim, se = FALSE, scheme = 2, main = "")

Contour plot of the fitted sop-film spline produced using plot.gam() with scheme = 2.

Notice how the contours of the fitted soap-film surface are parallel to the peninsular shoreline — as we’d expect if we studied lakes. We’ll return to this momentarily.

As we aren’t in the business of drawing pictures¹⁰

If you want to draw pictures, base graphics is better than ggplot2. But most people don’t want to draw pictures with #rstats
— Hadley Wickham ((???)) March 22, 2016

I should plot the fitted model using ggplot(). Note that the predict() step here is slow — I could probably speed it up a lot by removing all the points that are outside the lake boundary (see below) because we already know those points will be NAs and just hence dropped from any plotting or subsequent analysis.

pdata2 <- transform(pdata[, 1:2], Depth = predict(m2, newdata = pdata))

ggplot(foutline, aes(x = long, y = lat)) +
    geom_raster(data = pdata2, aes(x = os_x, y = os_y, fill = Depth)) +
    geom_path() +
    geom_point(data = fdepth, aes(x = os_x, y = os_y), size = 0.5) +
    coord_fixed() + ylab("Northing") + xlab("Easting") +
    scale_fill_viridis(na.value = NA)

The fitted surface achieved using a soap-film smoother

The first thing to notice is that the predict() method automatically sets points outside the boundary to NA for soap-film smoother models: you have to do this manually with the other types of smoother.

The main improvement in the soap-film model is the performance of the fitted depth surface around the peninsular. Notice now how on the right of the peninsular, the depth lessens towards the shoreline, and on the left depth increases from 0 away from the peninsular. Importantly, however, the deeper points on the right are not leaking information across the peninsular.

We could have achieved a better fit with the TPRS model by including the boundary coordinates with the depth data with depths 0. This would have improved the performance around the edge of the lake, but it wouldn’t have had the same effect as the soap-film smoother around the peninsular. Why so? Well, in the soap-film, we set the values of the boundary to be zero and the soap-film smooths from the data points to those known values but won’t smooth across the boundary of the domain. The TPRS model however would treat the 0 depth values differently: in simple terms it will smooth through the values, not to them. Hence the spline will get pulled towards zero somewhat, but the spline will still be “averaging” the depth data from a local region around the peninsular, information which includes the deeper data we don’t want to leak.

To help compare the two surfaces, I do a little more data munging to remove TPRS points outside the lake boundary and combine them with the soap-film data.

inlake <- with(pdata, inSide(bound, os_x, os_y))
pdata <- pdata[inlake, ]

pdata2 <- transform(rbind(pdata, pdata2),
                    Model = rep(c("TPRS", "Soap-film"),
                                times = c(nrow(pdata), nrow(pdata2))))

## let's drop the NAs from the Soap-film too...
take <- with(pdata2, !is.na(Depth))
pdata2 <- pdata2[take, ]

poutline <- transform(rbind(foutline, foutline),
                      Model = rep(c("TPRS", "Soap-film"), each = nrow(foutline)))
names(poutline)[1:2] <- c("os_x", "os_y")

ggplot(poutline, aes(x = os_x, y = os_y)) +
    geom_raster(data = pdata2, aes(x = os_x, y = os_y, fill = Depth)) +
    geom_path() +
    geom_point(data = fdepth, aes(x = os_x, y = os_y), size = 0.5) +
    coord_fixed() +
    ylab("Northing") + xlab("Easting") +
    scale_fill_viridis(na.value = NA) +
    facet_wrap( ~ Model) +
    theme(legend.position = "top", legend.key.width = unit(2.5, "cm"))

Comparison of fitted depth surfaces for the soap-film and TPRS smoother models

The effect is subtle in these plots, but the differences between the two are clear. Most important, the leakage of information across the peninsular, clearly visible in the TPRS model is removed in the soap-film version.

Soap-film smoothers are not the only way to approach finite area smoothing. David Miller did his PhD with Simon Wood and developed the generalised distance spline approach to the finite area smoothing problem (Miller and Wood, 2014), and Ramsay introduced his FELSPLINE method (Ramsay, 2002). I’ve not had chance to investigate David’s generalised distance spline method yet but if I do, I’ll no doubt write a post comparing the results with the soap-film method.

References

Miller, D. L., and Wood, S. N. (2014). Finite area smoothing with generalized distance splines. Environmental and ecological statistics 21, 715–731. doi:10.1007/s10651-014-0277-4.

Ramsay, T. (2002). Spline smoothing over difficult regions. Journal of the Royal Statistical Society. Series B, Statistical methodology 64, 307–319. doi:10.1111/1467-9868.00339.

Wood, S. N., Bravington, M. V., and Hedley, S. L. (2008). Soap film smoothing. Journal of the Royal Statistical Society. Series B, Statistical methodology 70, 931–955. doi:10.1111/j.1467-9868.2008.00665.x.

smoothing over a domain with known boundaries, like a lake↩
See the example in ?fs.test after loading package mgcv↩
and soap-film smooths are pretty damn cool all the same!↩
or a soapy, sticky mess depending upon your point of view…↩
ignoring the fact that lake levels often rise and fall through the year or over years.↩
because topography↩
I should have removed all prediction points outside the lake as these are very far from the support of the data.↩
The other form is a list with sub-data frame(s), each data frame is a separate loop.↩
It is probably worth trying a range of knots and varying their locations if you are taking this very seriously.↩
Sorry Hadley, I couldn’t resist.↩

Additive modelling global temperature time series: revisited

2016-03-25T00:00:00+01:00

Quite some time ago, back in 2011, I wrote a post that used an additive model to fit a smooth trend to the then-current Hadley Centre/CRU global temperature time series data set. Since then the media and scientific papers have been full of reports of record warm temperatures in the past couple of years, of controversies (imagined) regarding data-changes to suit the hypothesis of human induce global warming, and the brouhaha over whether global warming had stalled; the great global warming hiatus or pause. So it seemed like a good time to revisit that analysis and update it using the latest HadCRUT data.

A further motivation was my reading Cahill et al. (2015), in which the authors use a Bayesian change point model for global temperatures. This model is essentially piece-wise linear but with smooth transitions between the piece-wise linear components. I don’t immediately see where in their Bayesian model the smooth transitions come from, but that’s what they show. My gut reaction was why piece-wise linear with smooth transitions? Why not smooth everywhere? And that’s what the additive model I show here assumes.

First, I grab the data (Morice et al., 2012) from the Hadley Centre’s website and load it into R

library("curl")
tmpf <- tempfile()
curl_download("http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/time_series/HadCRUT.4.4.0.0.annual_ns_avg.txt", tmpf)
gtemp <- read.table(tmpf, colClasses = rep("numeric", 12))[, 1:2] # only want some of the variables
names(gtemp) <- c("Year", "Temperature")

The values in Temperature are anomalies relative to 1961–1990, in degrees C.

The model I fitted in the last post was

[ y = _0 + f() + , N(0, ^2) ]

where we have a smooth function of Year as the trend, and allow for possibly correlated residuals via correlation matrix ( ).

The data set contains a partial set of observations for 2016, but seeing as that year is (at the time of writing) incomplete, I delete that observation.

gtemp <- head(gtemp, -1)                # -1 drops the last row

The data are shown below

library("ggplot2")
theme_set(theme_bw())
p1 <- ggplot(gtemp, aes(x = Year, y = Temperature)) +
    geom_point()
p1 + geom_line()

HadCRUT4 global mean temperature anomaly

The model described above can be fitted using the gamm() function in the mgcv package. There are other options that allow one to use gam(), or even bam() in the same package, which are simpler, but I want to keep this post consistent with the one from a few years ago, so gamm() it is. Recall that gamm() represents the additive model as a mixed effects model via the well-known equivalence between random effects and splines, and fits the model using lme(). This allows for correlation structures in the residuals. Previously we saw that an AR(1) process in the residuals was the best fitting of the models tried, so we start with that and then try a model with AR(2) errors.

library("mgcv")

Loading required package: nlme

This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.

m1 <- gamm(Temperature ~ s(Year), data = gtemp, correlation = corARMA(form = ~ Year, p = 1))
m2 <- gamm(Temperature ~ s(Year), data = gtemp, correlation = corARMA(form = ~ Year, p = 2))

A generalised likelihood ratio test suggests little support for the more complex AR(2) errors model

anova(m1$lme, m2$lme)

       Model df       AIC       BIC   logLik   Test L.Ratio p-value
m1$lme     1  5 -277.7465 -262.1866 143.8733                       
m2$lme     2  6 -278.2519 -259.5799 145.1259 1 vs 2 2.50538  0.1135

The AR(1) has successfully modelled most of the residual correlation

ACF <- acf(resid(m1$lme, type = "normalized"), plot = FALSE)
ACF <- setNames(data.frame(unclass(ACF)[c("acf", "lag")]), c("ACF","Lag"))
ggplot(ACF, aes(x = Lag, y = ACF)) +
    geom_hline(aes(yintercept = 0)) +
    geom_segment(mapping = aes(xend = Lag, yend = 0))

Autocorrelation function of residuals from the additive model with AR(1) errors

Before drawing the fitted trend, I want to put a simultaneous confidence interval around the estimate. mgcv makes this very easy to do via posterior simulation. To simulate from the fitted model, I have written a simulate.gamm() method for the simulate() generic that ships with R. The code below downloads the Gist containing the simulate.gam code and then uses it to simulate from the model at 200 locations over the time period of the observations. I’ve written about posterior simulation from GAMs before, so if the code below or the general idea isn’t clear, I suggest you check out the earlier post.

tmpf <- tempfile()
curl_download("https://gist.githubusercontent.com/gavinsimpson/d23ae67e653d5bfff652/raw/25fd719c3ab699e48927e286934045622d33b3bf/simulate.gamm.R", tmpf)
source(tmpf)

set.seed(10)
newd <- with(gtemp, data.frame(Year = seq(min(Year), max(Year), length.out = 200)))
sims <- simulate(m1, nsim = 10000, newdata = newd)

ci <- apply(sims, 1L, quantile, probs = c(0.025, 0.975))
newd <- transform(newd,
                  fitted = predict(m1$gam, newdata = newd),
                  lower  = ci[1, ],
                  upper  = ci[2, ])

Having arranged the fitted values and upper and lower simultaneous confidence intervals tidily they can be added easily to the existing plot of the datat

p1 + geom_ribbon(data = newd, aes(ymin = lower, ymax = upper, x = Year, y = fitted),
                 alpha = 0.2, fill = "grey") +
    geom_line(data = newd, aes(y = fitted, x = Year))

Estimated trend in global mean temperature plus 95% simultaneous confidence interval

Whilst the simultaneous confidence interval shows the uncertainty in the fitted trend, it isn’t as clear about what form this uncertainty takes; for example, periods where there is little change or large uncertainty are often characterised by a wide range range of functional forms, not just flat, smooth functions. To get a sense of the uncertainty in the shapes of the simulated trends we can plot some of the draws from the posterior distribution of the model

set.seed(42)
S <- 50
sims2 <- setNames(data.frame(sims[, sample(10000, S)]), paste0("sim", seq_len(S)))
sims2 <- setNames(stack(sims2), c("Temperature", "Simulation"))
sims2 <- transform(sims2, Year = rep(newd$Year, S))

ggplot(sims2, aes(x = Year, y = Temperature, group = Simulation)) +
    geom_line(alpha = 0.3)

50 random simulated trends drawn from the posterior distribution of the fitted model

If you look closely at the period 1850–1900, you’ll notice a wide range of trends through this period, each of which is consistent with the fitted model but illustrates the uncertainty in the estimates of the spline coefficients. An additional factor is that these splines have a global amount of smoothness; once the smoothness parameter(s) are estimated, the smoothness allowance this affords is spread evenly over the fitted function. Adaptive splines would solve this problem as they in effect allow you to spread the smoothness allowance unevenly, using it sparingly where there is no smooth variation in he data and applying it liberally where there is.

An instructive visualisation for the period of the purported pause or hiatus in global warming is to look at the shapes of the posterior simulations and the slopes of the trends for each year. I first look at the posterior simulations:

ggplot(sims2, aes(x = Year, y = Temperature, group = Simulation)) +
    geom_line(alpha = 0.5) + xlim(c(1995, 2015)) + ylim(c(0.2, 0.75))

Warning: Removed 8750 rows containing missing values (geom_path).

50 random simulated trends drawn from the posterior distribution of the fitted model: 1995–2015

Whilst the plot only shows 50 of the 10,000 posterior draws, it’s pretty clear that, in these data at least, there is little or no support for the pause hypothesis; most of the posterior simulations are linearly increasing over the period of interest. Only one or two show a marked shallowing of the slope of the simulated trend through the period.

The first derivatives of the fitted trend can be used to determine where temperatures are increasing or decreasing. Using the standard error of the derivative or posterior simulation we can also say where the confidence interval on the derivative doesn’t include 0 — suggesting statistically significant change in temperature.

The code below uses some functions I wrote to compute the first derivatives of GAM(M) model terms via posterior simulation. I’ve written about this method before, so I suggest you check out that post if any of this isn’t clear.

tmpf <- tempfile()
curl_download("https://gist.githubusercontent.com/gavinsimpson/ca18c9c789ef5237dbc6/raw/295fc5cf7366c831ab166efaee42093a80622fa8/derivSimulCI.R", tmpf)
source(tmpf)

fd <- derivSimulCI(m1, samples = 10000, n = 200)

Loading required package: MASS

CI <- apply(fd[[1]]$simulations, 1, quantile, probs = c(0.025, 0.975))
sigD <- signifD(fd[["Year"]]$deriv, fd[["Year"]]$deriv, CI[2, ], CI[1, ],
                eval = 0)
newd <- transform(newd,
                  derivative = fd[["Year"]]$deriv[, 1], # computed first derivative
                  fdUpper = CI[2, ],                    # upper CI on first deriv
                  fdLower = CI[1, ],                    # lower CI on first deriv
                  increasing = sigD$incr,               # where is curve increasing?
                  decreasing = sigD$decr)               # ... or decreasing?

A ggplot2 version of the derivatives is produced using the code below. The two additional geom_line() calls add thick lines over sections of the derivative plot to illustrate those points where zero is not contained within the confidence interval of the first derivative.

ggplot(newd, aes(x = Year, y = derivative)) +
    geom_ribbon(aes(ymax = fdUpper, ymin = fdLower), alpha = 0.3, fill = "grey") +
    geom_line() +
    geom_line(aes(y = increasing), size = 1.5) +
    geom_line(aes(y = decreasing), size = 1.5) +
    ylab(expression(italic(hat(f) * "'") * (Year))) +
    xlab("Year")

Warning: Removed 74 rows containing missing values (geom_path).

Warning: Removed 190 rows containing missing values (geom_path).

First derivative of the fitted trend plus 95% simultaneous confidence interval

Looking at this plot, despite the large (and expected) uncertainty in the derivative of the fitted trend towards the end of the observation period, the first derivatives of at least 95% of the 10,000 posterior simulations are all bounded well above zero. I’ll take a closer look at this now, plotting kernel density estimates of the posterior distribution of first derivatives evaluated at each year for the period of interest.

First I generate another 10,000 simulations from the posterior of the fitted model, this time for each year in the interval 1998–2015. Then I do a little processing to get the derivatives into a format suitable for plotting with ggplot and finally create kernel density estimate plots faceted by Year.

set.seed(123)
nsim <- 10000
pauseD <- derivSimulCI(m1, samples = nsim,
                       newdata = data.frame(Year = seq(1998, 2015, by = 1)))

annSlopes <- setNames(stack(setNames(data.frame(pauseD$Year$simulations),
                                     paste0("sim", seq_len(nsim)))),
                      c("Derivative", "Simulations"))
annSlopes <- transform(annSlopes, Year = rep(seq(1998, 2015, by = 1), each = nsim))

ggplot(annSlopes, aes(x = Derivative, group = Year)) +
    geom_line(stat = "density", trim = TRUE) + facet_wrap(~ Year)

Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years

We can also look at the smallest derivative for each year over all of the 10,000 posterior simulations

minD <- aggregate(Derivative ~ Year, data = annSlopes, FUN = min)
ggplot(minD, aes(x = Year, y = Derivative)) +
    geom_point()

Dotplot showing the minimum first derivative over 10,000 posterior simulations from the fitted additive model

Only 4 of the 18 years have a single simulation with a derivative less than 0. We can also plot all the kernel density estimates on the same plot to see if there is much variation between years (there doesn’t appear to be much going on from the previous figures).

library("viridis")
ggplot(annSlopes, aes(x = Derivative, group = Year, colour = Year)) +
    geom_line(stat = "density", trim = TRUE) + scale_color_viridis(option = "magma") +
    theme(legend.position = "top", legend.key.width = unit(3, "cm"))

Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years. The colour of each density estimate differentiates individual years

As anticipated, there’s very little between-year shift in the slopes of the trends simulated from the posterior distribution of the model.

Returning to Cahill et al. (2015) for a moment; the fitted trend from their Bayesian change point model is very similar to the fitted spline. There are some differences in the early part of the series; where their model has a single piecewise linear function through 1850–1900, the additive model suggests a small decrease in global temperatures leading up to 1900. Thereafter the models are very similar, with the exception that the smooth transitions between periods of increase are somewhat longer with the additive model than the one of Cahill et al. (2015).

References

Cahill, N., Rahmstorf, S., and Parnell, A. C. (2015). Change points of global temperature. Environmental research letters: ERL [Web site] 10, 084002. doi:10.1088/1748-9326/10/8/084002.

Morice, C. P., Kennedy, J. J., Rayner, N. A., and Jones, P. D. (2012). Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 data set. J. Geophys. Res. 117, D08101.

Better use of transfer functions?

2015-12-16T00:00:00+01:00

Transfer functions have had a bit of a hard time of late following Steve Juggins (2013) convincing demonstration that 1) secondary gradients can influence your model, and 2) that variation down-core in a secondary variable can induce a signal in the thing being reconstructed. This was followed up by further comment on diatom-TP reconstructions (Juggins et al., 2013), and not to be left out, chironomid transfer functions have come in from some heat, if the last (that I went to) IPS meeting was any indication. In a session at the 2015 Fall Meeting of the AGU, my interest was piqued by Yarrow Axford’s talk using chironomid temperature reconstructions, but not for the reasons you might be thinking.

Yarrow’s talk covered her work on temperature reconstructions from lakes around Greenland. For some reasons that she didn’t go into the ice core records aren’t the ultimate decider of temperature trends in Greenland over the Holocene. Other temperature records are needed to better characterise variations in temperature over the last 10,000 years. Which is where the chironomids come in…

For those of you now expecting a rant about the abuse and misuse of transfer functions, well, sorry to disappoint. What interested me about Yarrow’s talk was that she addressed upfront the potential for issues with transfer functions reconstructions. This acceptance of the problems from people using transfer functions is something new to me¹ and it is a welcome development indeed.

Yarrow decided that she would only trust chironomid temperature reconstructions if they met three criteria:

that the core contains several species sensitive to temperature change, at warm and cold temperatures, and that the record wasn’t dominated by aggregated taxa like Tanytarsus with consequently broad or no temperature sensitivity,
that, using independent training sets the temperature reconstructions yielded the same general trend, and
that the potential for change in secondary gradients to be contained in the sediment record was minimal.

Let’s be clear; any additional thought that people put into assessing the quality of reconstructions is to be applauded. I’m not convinced by the merit or utility of all of Yarrow’s rules but this sort of thinking is refreshing.

Rule 1 seems odd to me. Perhaps this is because I don’t know much about chironomids? It seems self-evident that reconstructions from assemblages dominated by non-sensitive taxa aren’t to be trusted or might be subject to lots of noise or influence from secondary sources. It is also difficult to operationalise this rule; how high a proportion of the assemblage do we allow for these non-sensitive taxa before we worry about the reconstruction? I suspect this could be informed by some simulations similar to those used in Steve’s Sick Science paper if someone wanted to do it.

Rule 2, for me, is the weakest. Yarrow used a NE North American chironomid-temperature data set as the main training set because of the geographical location of her lake sites, but used a separate training set of Pete Langdon’s from Iceland as the independent data set. This Iceland data set used different taxonomic decisions and groupings, the idea being that if similar reconstructions were produced using it, we can have more confidence in the reconstruction. The problem with all this however is that reconstructions generated by independent training sets aren’t independent because they obviously use the same core assemblage data.

Transfer functions are largely just fancy filters of assemblage data; to generalise broadly, if the species composition changes we’ll see a change in the reconstructed values and the magnitude of this change in the reconstruction is determined by whether the species that are changing abundance are important indicators in the training set, or not, for the variable of interest. This is where the real elephant in the transfer function room lives; no matter how carefully you build your training set, you are always at the mercy of whatever signals your lake recorded in the sediments. I’m getting ahead of myself however.

As far as all this pertains to Yarrow’s Rule 2, we must be careful not to think of these different reconstructions as being independent. We have only one record of compositional change so we can’t generate radically different reconstructions, unless that is if the training sets contain radically different species-environment relationships. I find it hard to believe that any training set from comparable environments will embed radically different species-environment relationships; organisms like chironomids just don’t seem built that way.

So where does that leave Rule 2? I would say that if the reconstructions produced are qualitatively different (different trends, implications, …), that should set the alarm bells ringing. There’s clearly something in the reconstruction that is sensitive to the sorts of taxonomic aggregations that differentiate the training sets.

But what if the reconstructions are qualitatively similar? I’m far from convinced that this should give any assurance that the reconstruction is any more reliable that before. It could just as easily be that any secondary gradients induce trends in the reconstructions in the same way in both training sets.

Which brings me to Yarrow’s Rule 3. Just as we minimise, to the best of our ability, the secondary gradients in training sets, minimising the potential for secondary influences in the core record is as important. Yarrow did this in the case of her research by choosing lakes in catchments with no catchment vegetation or any soil to speak of — from the photo she showed of one of her sites, she nailed this one!

Development of soils and vegetation in catchments has profound effects on the lake ecosystem and especially in the forms and sources of nutrients and other compounds to the lake. Such effects have logical consequences for the lake biota. Now, while the initial development of soil processes and vegetation in the Arctic at the end of the last glacial and start of the Holocene are clearly temperature driven, if you are interested in temperature variation throughout the Holocene, there are lots of things that might affect nutrient inputs from catchments, or modify the in-lake environment that are not driven by temperature. In those circumstances, if your interest is in neoglacial cooling, the medieval warm period, etc, interference from these secondary gradients can be a real problem.

What really impressed me about Yarrow’s use of the transfer functions was that clearly a lot of thought had gone into site selection and how to best guard against the inherent problems in the methods. Perhaps I’ve been away from jobbing palaeolimnologists for too long² but, quibbles about Rules 1 and 2 aside, this is welcome and long overdue attention that we need more of.

References

Juggins, S. (2013). Quantitative reconstructions in palaeolimnology: New paradigm or sick science? Quaternary science reviews 64, 20–32. doi:10.1016/j.quascirev.2012.12.014.

Juggins, S., John Anderson, N., Ramstack Hobbs, J. M., and Heathcote, A. J. (2013). Reconstructing epilimnetic total phosphorus using diatoms: Statistical and ecological constraints. Journal of paleolimnology 49, 373–390. doi:10.1007/s10933-013-9678-x.

Having not been the recent IPS meeting in Lanzhou I’m even further removed from the application of transfer functions these days. I was aware that there had been some movement on both sides to identify ways forward for people wanting to implement or create reconstructions, however.↩
I’ve only been away from the ECRC for coming on three years!↩

AGU Fall Meeting 2015

2015-12-14T00:00:00+01:00

My poster, Rapid ecological change in lake ecosystems (GC13G-1236) in the Sedimentary records of threshold change (GC13G Moscone South Poster Hall 1340–1800, monday 14th December) describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores. Using data from a varved lake, Baldeggersee, Switzerland, I use location scale generalised additive models to simultaneously model the mean (trend) and the variance of a time series of diatom counts. Wavelets were used to investigate further variation in species dynamics during the well-documented history of eutrophication at the lake.

Both of these techniques may be applied to data from less ideal situations, where observations are irregularly sampled in time and have varying sample intervals/effects of time averaging.

A PDF of my poster can be downloaded from Figshare.

Are some seasons warming more than others?

2015-11-23T00:00:00+01:00

I ended the last post with some pretty plots of air temperature change within and between years in the Central England Temperature series. The elephant in the room¹ at the end of that post was is the change in the within year (seasonal) effect over time statistically significant? This is the question I’ll try to answer, or at least show how to answer, now.

The model I fitted in the last post was

[ y = _0 + f(x_1, x_2) + , N(0, ^2) ]

and allowed, as we saw, for the within year spline/effect to vary smoothly with the trend or between year effect. Answering our scientific question require that we determine whether the spline interaction model (above) fits the data significantly better than the additive model

[ y = 0 + f{}(x_1) + f_{}(x_2) + , N(0, ^2) ]

which has a fixed seasonal effect?

The model we ended up with was the spline interaction with an AR(7) in the residuals. To catch you up, the chunk below loads the CET data and fits the model we were left with at the end of the previous post

library("mgcv")

Loading required package: nlme
This is mgcv 1.8-9. For overview type 'help("mgcv-package")'.

library("ggplot2")

Loading required package: methods

source(con <- url("http://bit.ly/loadCET", method = "libcurl"))
close(con)
cet <- loadCET()

## need a list with gamm default for verbose output
ctrl <- list(niterEM = 0, optimMethod="L-BFGS-B", maxIter = 100, msMaxIter = 100)

## knots - see previous post
knots <- list(nMonth = c(0.5, seq(1, 12, length = 10), 12.5))
m <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
          data = cet, method = "REML", control = ctrl, knots = knots,
          correlation = corARMA(form = ~ 1 | Year, p = 7))

To answer our question, we want to fit the following two pseudo-code models and compare them using a likelihood ratio test

m1 <- gam(y ~ s(x1, x2), data = foo)
m0 <- gam(y ~ s(x1) + s(x2), data = foo)
anova(m1, m0)

As is often the case in the real world, things aren’t quite so simple; there are several issues we need to take care of if we are going to really be testing nested models and the smooth terms that we’re interested in, specifically we need to

ensure that the models really are nested models,
fit using maximum likelihood (method = “ML”) not residual maximum likelihood (method = “REML”) because the two models have different fixed effects
fit the same AR(7) process in the residuals in both models.

To compare additive models we really want to ensure that the fixed effects parts are properly nested and appropriate for an ANOVA-like decomposition of main effects and interactions. mgcv provides a very simple way to achieve this via a tensor product interaction smooth and the ti() function. ti() smooths are created in the same way as the te() smooth we encountered in the last post, but unlike te(), ti() smooths do not incorporate the main effects of the terms involved in the smooth. It is further assumed therefore that you have included the main effects smooths in the model formula.

Hence we can now fit models like

y ~ s(x1) + s(x2)
y ~ s(x1) + s(x2) + ti(x1, x2)

and be certain that the s(x1) and s(x2) terms in each model are equivalent. Note that you can use s() or ti() for these main effects components; if you have a single variable involved in a ti() term you get the main effect. I’m going to use s() in the code below, because I had better experience fitting the gamm() models we’re using with s() rather than ti() main effects.

Fitting with maximum likelihood instead of residual maximum likelihood is just a simple matter of using method = “ML” in the gamm() call.

The last thing we need to fix before we proceed is making sure that the main effects model and the main effects plus interaction model both incorporate the same AR(7) process that we fitted originally and which we refitted here earlier as m. To achieve this, we need to supply the AR coefficients to corARMA() when fitting our decomposed models, and indicate that gamm() (well, the underlying lme() code) shouldn’t try to estimate any of the parameters for the AR(7) process.

We can access the AR coefficients of m through the intervals() extractor functions and a little bit of digging. In the chunk below I store the AR(7) coefficients in the object phi. Now when fitting the gamm() models we have to pass value = phi, fixed = TRUE to the corARMA() bits of the model call to have it use the supplied coefficients instead of estimating a new set.

We are now ready to fit our two models to test whether the interaction smooth is required

phi <- unname(intervals(m$lme, which = "var-cov")$corStruct[, 2])
m1 <- gamm(Temperature ~ s(Year, bs = "cr", k = 10) + s(nMonth, bs = "cc", k = 12) +
               ti(Year, nMonth, bs = c("cr","cc"), k = c(10, 12)),
           data = cet, method = "ML", control = ctrl, knots = knots,
           correlation = corARMA(value = phi, fixed = TRUE, form = ~ 1 | Year, p = 7))
m0 <- gamm(Temperature ~ s(Year, bs = "cr", k = 10) + s(nMonth, bs = "cc", k = 12),
           data = cet, method = "ML", control = ctrl, knots = knots,
           correlation = corARMA(value = phi, fixed = TRUE, form = ~ 1 | Year, p = 7))

The anova() method is used to compared the fitted models

anova(m0$lme, m1$lme)

       Model df     AIC      BIC    logLik   Test  L.Ratio p-value
m0$lme     1  5 14750.9 14782.70 -7370.449                        
m1$lme     2  7 14706.0 14750.52 -7346.001 1 vs 2 48.89479  <.0001

There is clear support for m1 the model that allows for the seasonal smooth to vary as a smooth function of the trend over the model with additive effects.

What does our model say about the change in monthly temperature over the past century? Below I simply predict the temperature for each month in 1914 and 2014 and then compute the difference between years.

pdat <- with(cet,
             data.frame(Year = rep(c(1914, 2014), each = 12),
                        nMonth = rep(1:12, times = 2)))
pred <- predict(m$gam, newdata = pdat)
pdat <- transform(pdat, fitted = pred, fYear = as.factor(Year))
dif <- with(pdat, data.frame(Month = 1:12,
                             Difference = fitted[Year == 2014] - fitted[Year == 1914]))

A plot of the temperature differences² is shown below, being produced by the following code

ggplot(dif, aes(x = Difference, y = Month)) +
    geom_point() +
    labs(x = expression(Temperature ~ difference ~ degree*C),
         y = "Month") +
    theme_bw() +                        # minimal theme
    scale_y_continuous(breaks = 1:12,   # tweak where the x-axis ticks are
                       labels = month.abb, # & with what labels
                       minor_breaks = NULL) +
    scale_x_continuous(breaks = seq(0, 1.2, by = 0.1),
                       minor_breaks = NULL)

Difference in monthly temperature predictions between 1914 and 2014

Most months have seen at least ~0.5°C increase in mean temperature between 1914 and 2014, with October and November both experiencing over a degree of warming over the period.

Before I finish, it is instructive to look at what the ti() term in the decomposed model looks like and represents

layout(matrix(1:3, ncol = 3))
op <- par(mar = rep(4, 4) + 0.1)
plot(m1$gam, pers = TRUE, scale = 0)
par(op)
layout(1)

Smooths for the spline interaction model including a tensor product interaction smooth

The first two terms are the overall trend and seasonal cycle respectively. The third term, shown as a perspective plot, is the tensor production interaction term. This term reflects the amount by which the fitted temperature is adjusted from the overall trend and seasonal cycle for any combination of month and year.

well, one of the elephants; I also wasn’t happy with the AR(7) for the residuals↩
If I was being more thorough, I could use the prediction matrix feature of gam() models to put approximate confidence intervals on these differences.↩

Climate change and spline interactions

2015-11-21T00:00:00+01:00

In a series of irregular posts¹ I’ve looked at how additive models can be used to fit non-linear models to time series. Up to now I’ve looked at models that included a single non-linear trend, as well as a model that included a within-year (or seasonal) part and a trend part. In this trend plus season model it is important to note that the two terms are purely additive; no matter which January you are predicting for in a long timeseries, the seasonal effect for that month will always be the same. The trend part might shift this seasonal contribution up or down a bit, but all January’s are the same. In this post I want to introduce a different type of spline interaction model that will allow us to relax this additivity assumption and fit a model that allows the seasonal part of the model to change in time along with the trend.

As with previous posts, I’ll be using the Central England Temperature time series as an example. The data require a bit of processing to get them into a format useful for modelling, so I’ve written a little function — loadCET() — that downloads the data and processes it for you. To load the function into R, run the following

source(con <- url("http://bit.ly/loadCET", method = "libcurl"))
close(con)
cet <- loadCET()

We also need a couple of packages for model fitting and plotting

library("mgcv")

Loading required package: nlme
This is mgcv 1.8-9. For overview type 'help("mgcv-package")'.

library("ggplot2")

Loading required package: methods

OK, let’s begin…

As previously, if we think about a time series where observations were made on a number of occasions within any given year over a number of years, we may want to model the following features of the data

any trend or long term change in the level of the time series, and
any seasonal or within-year variation, and
any variation in, or interaction between, the trend and seasonal features of the data.

In a previous post I tackled features 1 and 2, but it is feature 3 that is of interest now. Our model for features 1 and 2 was

[ y = 0 + f{}(x_1) + f_{}(x_2) + , N(0, ^2) ]

where (0) is the intercept, (f{}) and (f_{}) are smooth functions for the seasonal and trend features we’re interested in, and (x_1) and (x_2) are to covariate data providing some form of time indicators for the within-year and between year times.

To allow for an interaction between (f_{}) and (f_{}) we will need to fit the following modle instead

[ y = _0 + f(x_1, x_2) + , N(0, ^2) ]

Notice now that (f()) is a smooth function of our two time variables, and for simplicity’s sake let’s say that the within-year variable will just be the numeric month indicator (1, 2, …, 12) and the between year variable will be the calendar year of the observation. In previous posts I’ve used a derived time variable instead of calendar year for the trend, but doing that here is largely redundant; the data seem well modelled even if we don’t allow for a trend within-year, and doing some useful or interesting things with the model once fitted is much simplified if we just use observation year for the trend.

In pseudo mgcv code we are going to fit the following model

mod <- gam(y = te(x1, x2), data = foo)

The te() represents a tensor product smooth of the indicated variables. We won’t be using s() because our two time variables are unrelated, and we want to allow for more variation in one of the variables than the other; multivariate s() smooths are isotropic, so they’re good for things like spatial coordinates but not things measured in different units or having more variation in one variable than the other. I’m not going to go into the detail of tensor product smooths; that’s covered in Simon Wood’s rather excellent book.

Another detail that we need to consider is knot placement. Previously I used a cyclic spline for the within-year term and allowed gam() to select the knots for the spline from the data. This meant that boundary knots were at months 1 and 12. This worked ok where I’ve been modelling daily data so the within-year term is in Julian day say, as the knots would be at 1 and 366 and it didn’t matter much if December 31^st was exactly the same as January 1^st. But with monthly data like this it is a bit of a problem; we don’t expect December and January to be exactly the same. This problem was anticipated in the comments of the previous post by a reader and I sort of dismissed it. Well, I was wrong and it took me until I set about interrogating the model that I’ll fitcshortly to realise it.

What we need to do is place boundary knots just beyond the data, such that the distance between December and January is the same as the distance between any other month. Placing boundary knots at (0.5, 12.5) achieves this. We then have 10 more interior knots to play with (assuming 12 knots overall, which is what I specify for k below), so I just place those, spread evenly between 1 and 12 (the inner seq() call).

knots <- list(nMonth = c(0.5, seq(1, 12, length = 10), 12.5))

Having dealt with those details, we can fit some models; here I fit models with the same fixed effects parts (the spline interaction) but with differing stochastic trend models in the residuals.

To assist our selection of the stochastic model in the residuals, we fit a naive model that assumes independence of observations

m0 <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
           data = cet, method = "REML", knots = knots)

Plotting the autocorrelation function (ACF) of the normalized residuals from the $lme</code> part of this model fit we can start to think about plausible models for the residuals. Remember though that we are going to nest this within-year, so we’re only going to be able to do anything about the first 12 lags even though I’ll still show the default number <figure class="highlight"> <pre><code class="language-r" data-lang="r">plot(acf(resid(m0$lme, type = “normalized”)))

ACF for model m0 a naive additive model assuming conditional independence of observations fitted to the CET time series

In the ACF we see lingering correlations out to lag 7 or 8 and then longer-range lags out beyond a year. These latter lags are the between-year temporal signal that we aren’t capturing perfectly with the temporal trend component of the model fit. We’re going to ignore these, for now at least — I may return to look at these in a future post.

From the ACF (and a bit of fiddling, err… EDA) it looks like AR terms are needed to model this residual autocorrelation. Hence the stochatsic trend models are AR(p), for p in {1, 2, …, 8}. The ARMA is nested within year, as previously; with the switch to modelling using calendar year for the trend term, I would anticipate stronger within year autocorrelation in residuals, or possible a more complex structure, than observed in earlier fits².

If you want to fit all the models great, I’ll get to you in a moment — just don’t look at the value of p in the chunk below! If you just want to skip ahead, fit the following model and then move right along to the next section, thus saving yourself in the region of 10 minutes (on a fast as hell Xeon workstation) of thumb twiddling

ctrl <- list(niterEM = 0, optimMethod="L-BFGS-B", maxIter = 100, msMaxIter = 100)
m <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
          data = cet, method = "REML", control = ctrl, knots = knots,
          correlation = corARMA(form = ~ 1 | Year, p = 7))

For those of you in for the long haul, here’s a loop³ that will fit the models with varying AR terms for us

ctrl <- list(niterEM = 0, optimMethod="L-BFGS-B", maxIter = 100, msMaxIter = 100)
for (i in 1:8) {
    m <- gamm(Temperature ~ te(Year, nMonth, bs = c("cr","cc"), k = c(10,12)),
              data = cet, method = "REML", control = ctrl, knots = knots,
              correlation = corARMA(form = ~ 1 | Year, p = i))
    assign(paste0("m", i), m) 
}

A generalised likelihood ratio test can be used to test for which correlation structure fits best

anova(m1$lme, m2$lme, m3$lme, m4$lme, m5$lme, m6$lme, m7$lme, m8$lme)

       Model df      AIC      BIC    logLik   Test   L.Ratio p-value
m1$lme     1  6 14849.98 14888.13 -7418.988                         
m2$lme     2  7 14836.78 14881.29 -7411.389 1 vs 2 15.197206  0.0001
m3$lme     3  8 14810.73 14861.60 -7397.365 2 vs 3 28.047345  <.0001
m4$lme     4  9 14784.63 14841.86 -7383.314 3 vs 4 28.101617  <.0001
m5$lme     5 10 14778.35 14841.95 -7379.177 4 vs 5  8.275739  0.0040
m6$lme     6 11 14776.49 14846.44 -7377.244 5 vs 6  3.865917  0.0493
m7$lme     7 12 14762.45 14838.77 -7369.227 6 vs 7 16.032363  0.0001
m8$lme     8 13 14764.33 14847.01 -7369.167 7 vs 8  0.119909  0.7291

Lo and behold, the AR(7) turns out to have the best fit as assessed by a range of metrics. If we now look at the ACF of the normalized residuals for this model we see that all the within-year autocorrelation has been accounted for, leaving a little bit of correlation at lags just longer than a year.

plot(acf(resid(m7$lme, type = "normalized")))

ACF for model m7 an additive model with an AR(7) process in the residuals fitted to the CET time series

At this stage we can probably proceed without too much worry — although an AR(7) is quite a complex model to fit, so we should remain a little cautious.

Before we move on, to bring us up to speed with the people that jumped ahead, copy m7 into object m so the code in the next section works for you too.

m <- m7

Interrogating the fitted model

I’m going to cut to the chase and look at the fitted model and use it to ask some questions about how temperature has changed both within and between years over the last 100 years. In part 2 of this post I’ll look at doing inference on the fitted model, but for now I’ll skip that.

First, let’s visualise the fitted spline; this requires a 3D plot so it gets somewhat tricky to really see what’s going on, but here goes

plot(m$gam, pers = TRUE)

Fitted bivariate spline

This is quite a useful visualisation as it illustrates how the model represents longer term trends, seasonal cycles, and how these vary in relation to one another. Viewed one way, we have estimates of trends over years for each month. Alternatively, we could see the model as giving an estimate of the seasonal cycle for each year. Each year can have a different seasonal cycle and each month a different trend. If there was no interaction, there would be no change in the seasonal pattern other time — or all months would have the same trend over years. This figure also sucks; it’s 3D but static and the scale of the trend and any change in seasonal cycle over time is swamped by the magnitude of the seasonal cycle itself.

Predict monthly temperature for the years 1914 and 2014

In the first illustrative use of the fitted model, I’ll predict within-year temperatures for two years — 1914 and 2014 — to look at how different the seasonal cycle is after a 100 years⁴ of climate change (time). The first step is to produce the values of the covariates that we want to predict at. In the snippet below I generate 100 1914s followed by 100 2014s for Year, and within these years we have 100 evenly-spaced values on the interval (1,12) for nMonth.

pdat <- with(cet,
             data.frame(Year = rep(c(1914, 2014), each = 100),
                        nMonth = rep(seq(1, 12, length = 100), times = 2)))

Next, the predict() method generates predicted values for the new data pairs, with standard errors for each predicted value

pred <- predict(m$gam, newdata = pdat, se.fit = TRUE)
crit <- qt(0.975, df = df.residual(m$gam)) # ~95% interval critical t
pdat <- transform(pdat, fitted = pred$fit, se = pred$se.fit, fYear = as.factor(Year))
pdat <- transform(pdat,
                  upper = fitted + (crit * se),
                  lower = fitted - (crit * se))

The first transform() adds fitted, se, and fYear variables to pdat for the predictions, their standard errors, and a factor for Year that I’ll use in plotting shortly. The second transform() call adds upper and lower variables containing the upper and lower pointwise confidence bounds, here for an approximate 95% interval.

A plot, using the ggplot2 package, of the predicted monthly temperatures for 1914 and 2014 is created in the next chunk. It’s a little involved as I wanted to modify a few things and change the name of the legend to make it look nice — I’ve commented the lines to indicate what they do

p1 <- ggplot(pdat, aes(x = nMonth, y = fitted, group = fYear)) +
    geom_ribbon(mapping = aes(ymin = lower, ymax = upper,
                              fill = fYear), alpha = 0.2) + # confidence band
    geom_line(aes(colour = fYear)) +    # predicted temperatures
    theme_bw() +                        # minimal theme
    theme(legend.position = "top") +    # push legend to the top
    labs(y = expression(Temperature ~ (degree*C)), x = NULL) +
    scale_fill_discrete(name = "Year") + # correct legend name
    scale_colour_discrete(name = "Year") +
    scale_x_continuous(breaks = 1:12,   # tweak where the x-axis ticks are
                       labels = month.abb, # & with what labels
                       minor_breaks = NULL)
p1

Predicted monthly temperature for 1914 and 2014

Looking at the plot, most of the action appears in the autumn and winter months.

Predict trends for each month, 1914–2014

The second use of the fitted model will be to predict trends in temperature for each month over the period 1914–2014. For this we need a different set of new values to predict at than before; here I repeat the values 1914–2012 twelve times each and the sequence 1, 2, …, 12 101 times, once per year of the period of interest.

pdat2 <- with(cet,
              data.frame(Year = rep(1914:2014, each = 12),
                         nMonth = rep(1:12, times = 101)))

Next we repeat the earlier steps to predict from the model and set up an object for plotting with ggplot()

pred2 <- predict(m$gam, newdata = pdat2, se.fit = TRUE)
## add predictions & SEs to the new data ready for plotting
pdat2 <- transform(pdat2,
                   fitted = pred2$fit,  # predicted values
                   se = pred2$se.fit,   # standard errors
                   fMonth = factor(month.abb[nMonth], # month as a factor
                                   levels = month.abb))
pdat2 <- transform(pdat2,
                   upper = fitted + (crit * se), # upper and...
                   lower = fitted - (crit * se)) # lower confidence bounds

The first plot we’ll produce using these data is a plot of the trends faceted by fMonth

p2 <- ggplot(pdat2, aes(x = Year, y = fitted, group = fMonth)) +
    geom_line(aes(colour = fMonth)) +   # draw trend lines
    theme_bw() +                        # minimal theme
    theme(legend.position = "none") +   # no legend
    labs(y = expression(Temperature ~ (degree*C)), x = NULL) +
    facet_wrap(~ fMonth, ncol = 6) +    # facet on month
    scale_y_continuous(breaks = seq(4, 17, by = 1),
                       minor_breaks = NULL) # nicer ticks
p2

Predicted trends in monthly temperature, 1914–2014.

The impression that most of the action is in the autumn and winter is again very apparent.

Predict trends for each month, 1914–2014, by quarter

Another visualisation of the same predictions is to group the data by quarter/season. For that we set up a variable Quarter in the pred2 data frame and assign particular months to the seasons.

pdat2$Quarter <- NA
pdat2$Quarter[pdat2$nMonth %in% c(12,1,2)] <- "Winter"
pdat2$Quarter[pdat2$nMonth %in% 3:5] <- "Spring"
pdat2$Quarter[pdat2$nMonth %in% 6:8] <- "Summer"
pdat2$Quarter[pdat2$nMonth %in% 9:11] <- "Autumn"
pdat2 <- transform(pdat2,
                   Quarter = factor(Quarter,
                                    levels = c("Spring","Summer","Autumn","Winter")))

Then we facet on Quarter, and we need a legend to help identify the months, we do a little fiddling to get a nice name

p3 <- ggplot(pdat2, aes(x = Year, y = fitted, group = fMonth)) +
    geom_line(aes(colour = fMonth)) +   # draw trend lines
    theme_bw() +                        # minimal theme
    theme(legend.position = "top") +    # legend on top
    scale_fill_discrete(name = "Month") + # nicer legend title
    scale_colour_discrete(name = "Month") +
    labs(y = expression(Temperature ~ (degree*C)), x = NULL) +
    facet_grid(Quarter ~ ., scales = "free_y") # facet by Quarter
p3

Predicted trends in monthly temperature, 1914–2014, by quarter.

Summary

In this post I’ve looked at how we can fit smooth models with smooth interactions between two variables. This allows the smooth effect one variable to vary as a smooth function of the second variable. This approach can be extended to additional variables as needed.

One of the things I’m not very happy with is the rather complex AR process in the model residuals. The AR(7) mopped up all the within-year residual autocorrelation but it appears that there is a trade-off here between fitting a more complex seasonal smooth or a more complex within-year AR process.

An important aspect that I haven’t covered in this post is whether the interaction model is an improvement in fit over a purely additive model of a trend in temperature with the same seasonal cycle superimposed. I’ll look at how we can do that in part 2.

here, here, and here ↩
Note that this code assumes that samples are provided in the data in their time order within year. This is the case here, but if it isn’t, you could do form = ~ nMonth | Year to tell gamm() about the correct ordering.↩
I’m just being lazy; I could fit these models in parallel with the parallel package, but I’m caching this code chunk so, meh…↩
Yes, yes, yes, I know it’s 101 years…↩

User-friendly scaling

2015-10-08T00:00:00+02:00

Back in the mists of time, whilst programming early versions of Canoco, Cajo ter Braak decided to allow users to specify how species and site ordination scores were scaled relative to one another via a simple numeric coding system. This was fine for the DOS-based software that Canoco was at the time; you entered 2 when prompted and you got species scaling, -1 got you site or sample scaling and Hill’s scaling or correlation-based scores depending on whether your ordination was a linear or unimodal method. This system persisted; even in the Windows era of Canoco these numeric codes can be found lurking in the .con files that describe the analysis performed. This use of numeric codes for scaling types was so pervasive that it was logical for Jari Oksanen to include the same system when the first cca() and rda() functions were written and in doing so Jari perpetuated one of the most frustrating things I’ve ever had to deal with as a user and teacher of ordination methods. But, as of last week, my frustration is no more…

…because we released a patch update to the CRAN version of vegan. Normally we don’t introduce new functionality in patch releases but the change I made to the way users can request ordination scores was pretty trivial and maintained backwards compatibility.

Previously, different scalings could be requested using the scaling argument. scaling is an argument of the scores() function; anything function using scores() would either have scaling as a formal argument too, or would pass scaling on to scores() internally. To date, the different scores were specified as per DOS-era Canoco as numeric values. Now, scores() accepts either those same old numeric values or a character string for scaling coupled with a second logical argument. Vegan accepts the following character values to select the type of scaling:

“sites”, which gives site-focussed scaling, equivalent to numeric value 1
“species” (the default), which gives species- (variable-) focused scaling, equivalent to numeric value 2
“symmetric”, which gives a so-called symmetric scaling, and is equivalent to numeric value 3.

To get negative versions of these values, the correlation or hill argument should be set to TRUE as follows

correlation (default FALSE) for correlation-like scores for PCA/RDA/CAPSCALE models, or
hill (default FALSE) for Hill’s scaling for CA/CCA models

Whilst this requires the setting of two different arguments, it’s certainly a lot easier to remember these two arguments than what the numerical codes mean.

Obligatory Dutch dune meadows example

Here’s a quick example of the new usage showing a PCA of the classic Dutch dune meadow data set.

library("vegan")

Loading required package: permute
Loading required package: lattice
This is vegan 2.3-1

data(dune)

ord <- rda(dune)                  # fit the PCA
layout(matrix(1:2, ncol = 2))
plot(ord, scaling = "species")
plot(ord, scaling = "species", correlation = TRUE)
layout(1)

PCA of the Dutch dune meadow data set. Both biplots are drawn using species scaling, but the one on the right standardizes the species scores.

The two biplots are based on the same underlying ordination and both focus the scaling on best representing the relationships between species (scaling = “species”), but the biplot on the right uses correlation-like scores. This has the effect of making the species have equal representation on the plot without doing the PCA with standardized species data (all species having unit variance).

ESA's publishing deal with Wiley

2015-08-11T00:00:00+02:00

One of the big announcements about the society made by ESA in the run up to the annual meeting in Baltimore this week was the news that ESA has chosen to partner with John Wiley & Sons as publisher of the society journals. At the time of the announcement few details about the deal or the process by which this decision was made were available. I was attending the ESA Council as the incoming Chair of the Paleoecology Section where some further details were provided and members of Council were able to ask questions about the deal. These are my notes from that meeting.

Headlines

Deal brings financial stability for the society
Wiley will pay ESA a guaranteed royalty payment, annually
The deal allows for profit-sharing should Wiley increase subscriptions/income beyond some point (specifics not specified)
All ESA members will receive electronic access to the society journals, and membership dues will not increase (beyond usual annual 2% increase)
ESA members will have a print-on-demand option for those wanting hard copies
Frontiers will continue to be produced as a hard copy for all members
No hybrid Open Access option for papers (for now, Wiley & ESA will look at this going forward)
Page charges remain
Ecosphere charges (expected to) remain the same, full open access
Ecological Archives is likely to move to a new home and ESA and Wiley are currently discussing options for this though moving existing archives to Figshare is main option being explored
Somewhere in Ithaca there is a single computer running DOS(!) that performs a critical part of the current journal publishing platform used by ESA…
In contrast, Wiley’s publishing platform, whatever you might think of Scholar One or ReadCube, is light years better than EcoTrack…

Some detail

Council were expected to vote to approve, or not, the budget as presented by the VP Finance. The slides presented to Council to facilitate this included financial details of the Wiley deal. I asked whether these were for public consumption and had it confirmed that the numbers were public. The payment to ESA from Wiley in the 2015–16 budget is $1,350,357. This number includes

The royalty payment
An amount to cover some of ESAs costs with the journals (details of what this involved & the amount were either lacking or I didn’t catch them)

This number is only half what the income will be each year as ESA’s financial year runs July–June and hence the 2015–16 budget includes half a year of ESA self-publishing and half a year with Wiley publishing. I confirmed that in 2016–17 the payment from Wiley will be $2,700,714 and income from publications from subscriptions and page charges will drop to 0 at the same time.

I didn’t fully get down in my notes how the expenses/costs due to publications would change in 2016–17 and in the current year the picture is complicated because there are significant costs associated with migrating the journals to Wiley’s platform. Therefore, I don’t know exactly what the anticipated “profit” will be going forward. What is, I think, indicative is that the senior ESA staff and academics were clearly anticipating significant improvements in the “profit” generated by the Society’s journals that can be directed towards activities the Society does on behalf of its members and its support for ecology.

There are still many details about the deal and the process that are not clear or not covered in the Council meeting. What was abundantly clear was that the people present that were involved in the Publications Transition Committee, the senior ESA staff, the President, were all clearly acting in the best interests of the Society when they set out to investigate options for the Society’s journals and when they made the decision to choose Wiley as publisher. From what I saw, the Committee has certainly secured the good financial stability of the Society for the immediate future.

The new Tri-agency open access policy

2015-07-10T00:00:00+02:00

Earlier this year the triumvirate of Canadian science funding bodies, the Natural Science and Engineering Research Council, the Canadian Institutes of Health Research (CIHR) and the Social Sciences and Humanities Research Council of Canada (SSHRC) (collectively referred to as the Tri-Agencies), announced their new policy of open access to research publications. This followed a period of consultation, begun in the fall of 2013, with the science communities funded by the Tri-Agencies. The policy came into effect, effectively, on May 1st this year (2015) and applies to all Tri-agency-funded grants awarded from May 1st 2015 onward. As part of its awareness programme for the policy, the Tri-Agencies have been holding webinars to explain the new policy and allow for questions from researchers. In the main the Tri-Agency policy is pretty clear, judging by the questions from academics during the webinar session that I attended recently, but we can conclude one or both of two things: i) academics don’t read things unless the absolutely must, and ii) that academics have some interesting views about open access, what it means for them, and what they consider as being good-practice or complying with the new rules. I was asked after tweeting about this to summarise my notes from the webinar and on the Tri-Agency policy on open access in general.

In the main, the Tri-Agency policy on open access is pretty simple and to be fair to the Tri-Agencies, they have done a good job of providing information and FAQs that would probably answer most questions a general academic might have regarding the policy. That is if people would just read what the Tri-Agencies have written and stop trying to nit-pick or special-case their particular question. That said, there are some areas of the policy that are somewhat ambiguous and in other just plain missing.

But first, a summary of what the policy actually requires:

Within 12 month of publication the peer-reviewed, final author-version of the manuscript must be freely available from the journal website or an approved institutional or disciplinary repository.
The policy applies to peer-reviewed research publications arising from Tri-Agency-funded grants awarded May 1st 2015 or later.

The overriding issue here is ensuring that Canadians, be they members of the public, employees in industry, government officials, or academics, have access to the research outputs that the Canadian public have funded through their taxes. This is morally right; knowledge shouldn’t be locked away for the privileged few. But more than that it is the right thing to do economically and educationally for Canada. Industry can’t capitalise fully on the research paid for by Canadians if it is locked away behind exorbitant paywalls for example.

When the Tri-Agency speaks of research publications they exclusively mean peer-reviewed journal articles. So the policy doesn’t apply to research reports, monographs, book chapters, teaching materials etc. Just peer-reviewed papers. To be compliant, if you are in receipt of new research grant funding from one of the Tri-Agencies awarded May 1st 2015 onward, you are required to make freely available any peer-reviewed journal publications within 12 months of publication.

You can be compliant by following one of two routes

Deposit, within 12 months of publication, the final peer-reviewed (but not typeset) version of your manuscript in an approved repository. I’ll come back to what an approved repository is later, but ideally you’d deposit the paper in your institutional repository, often run by your institution’s library, os a discipline-specific repository. This is the Green Open Access route. Or
Publish in an open access journal or a so-called hybrid journal that allows open access. This is the Gold Open Access route and provides immediate open and free access from the date of publication, but may require the payment of an article processing charge (APC). Note that not all journals charge APCs and that not all journals that do charge levy similarly-high APCs.

Institutional or disciplinary repositories are ideal places to deposit Green OA (Route 1) papers.
(Source: Patrick Göthe, CC-0, Unsplash)

Route 1 allows you to continue to publish in your traditional journals and doesn’t incur any additional cost to the researcher, but you do need to ensure that the journal you want to publish in doesn’t have an embargo longer than 12 months and that your institution’s repository meets the requirements of the publisher. Here I’m thinking of Elsevier and its retarded policies requiring various licensing terms or separate agreements. Your library staff can help you with this so go talk to them before you start writing your next paper as you may have to send it to a different journal to be compliant with the new policy. (You shouldn’t be publishing with Elsevier anyway, but that is a different story.)

Route 2 is my preferred option but despite what I and other OA activists will tell you about the majority of open access journals not charging APCs, in practice, the Gold route is going to cost you some money. How much money depends on where you want to publish; it will be upwards of US$3000 if you want to publish in traditional subscription journals that offer an open access option for example. Far cheaper options exist such as PLOS One, Scientific Reports, The PeerJ, to name but three. Critical points to note here though are

You remain in charge of where you publish; journal choice is yours and yours alone (except for the embargo period!),
You don’t have to pay for Open Access; the Tri-Agencies do not require this of you, and
There is no additional funding from the Tri-Agencies to support the new policy

As such, you remain largely in charge of where you can submit your papers for publication and you can choose route 1 (self-deposit) if you don’t want to or can’t afford to pay the APCs from your grant. You are allowed to pay APCs from your grant as an allowable expense, but you’ll need to decide whether to pay them or use the money for something else.

What does within 12 months of publication mean? For the Route 1 option, the policy requires you to publish the paper within 12 months of the actual in-print publication date. The clock doesn’t start running according to the Tri-Agencies until your paper has been included in an issue. That means that online early publication doesn’t count towards the 12 month period.

What constitutes an approved repository? This is one area where the policy is necessarily unclear if you start to wander off piste. Your best bet is to use your Institutional Repository — this supports your institution as well as in general being compliant with the policy — or a discipline-based repository (PubMed Canada for example).Speak to the people, often the library staff, that run your institution’s repository for more information or visit the Canadian Association of Research Libraries Institutional Repository Project: Online Resource Portal for further guidance. If your institution doesn’t have a repository, don’t worry as you can use one of eight adoptive repositories run by academic institutions that will accept submissions from academics beyond their host institution.

You may also be able to use your favourite online repository, but note that the repository can’t require anyone to sign up for an account just to access papers (so Research Gate is out), and it is important to understand what the longevity of the repository and what its disaster plan are for submissions should it cease to operate or function. As there are so many of these online repositories, the Tri-Agencies cannot possibly provide an exhaustive list of approved ones, so they asked that people get in touch with the relevant Tri-Agency representative if you have questions about a specific repository.

Some common sense should help here; if you need to log in to read papers then the repository is not compatible with the policy. Putting your papers on your own website doesn’t constitute compliance either; Google may be pretty amazing at ferreting out resources on the web, but the Tri-Agencies take discoverability seriously so if you do put papers on your website, as I do here, you do also need to deposit them in an approved repository as well. If the repository is a commercial entity like Academia.edu, Research Gate, FigShare, etc then you do need to check closely what the publishers allow you to do in this regard if you have signed over copyright to them. Elsevier for example is requiring a non-commercial licence and for hosting on non-commercial terms which could rule out may of these repositories as potential venues for you paper even if the Tri-Agency considered them compliant.

Another area that the Tri-Agency policy is not clear on, is the licence terms that Gold Open Access (Route 2) papers should be made available under. The Agencies talk only about free availability but not freedom in terms of usage. Many publishers push researchers towards more restrictive licences (particularly with non-commercial clauses) rather than the accepted standard of the Creative Commons By Attribution (CC-BY) licence (or equivalent), which requires only that the original source and author be acknowledged. Academics should be wise to this and use the most permissive licence allowed (usually CC-BY) because other clauses, especially non-commercial ones, are a source of confusion (what does “commercial” actually mean?!) and could exclude the wider benefits to Canadians, its industry and economy that the Tri-Agency Policy was developed to promote.

Some smaller points that cropped up during the webinar were:

Compliance checking; at the moment you are required to comply with the policy upon your acceptance of funding. The Tri-Agencies are currently not doing any compliance checking but they are starting to think about what form such checks might take. I suspect it will be some years before compliance checking is common place, not least because the Tri-Agencies indicated they’d be consulting with the community on what form this should take also.
Data; apart for CIHR, the policy does not apply to data. Given the policy only applies to peer-reviewed research/journal publications should have been enough to cover this point but CIHR has long had a policy on Open Data so I guess NSERC and SSHRC needed to be explicit in this instance that they do not require Open Data. That said, Open Data is coming and I suspect it won’t be that many years before NSERC and SSHRC start consulting on that too.
Grant applications and OA fees; you can include funds for APCs on your research grant proposals. The Tri-Agencies will be working with the review panels to make it clear that these are allowed. Your budget should be reasonable of course and reviewers will be asked to comment on that, but requesting funds for APCs will not count against your grant in the review stage. “Reasonable” was the word that kept being used here.
NSERC Discovery grants announced but awarded prior to May 1st (in other words if you just received a Discovery Grant for the first time, or renewed one in the round just announced) are not covered by the new policy as they were awarded prior to May 1st 2015. I suppose this is because recipients applied for grants before details of the policy were announced so it is only fair to the recipients to not require compliance yet. Note also that the annual instalment of your Discovery Grant does not count as a new grant. So if you already have a Discovery Grant, that grant is not subject to the terms of the new policy. Only when you renew the Discovery Grant will you need to meet the requirements of the policy from that point on.
The policy is not retroactive; only new grants awarded May 1st 2015 or later will be subject to the policy.
The policy applies to all funded works, even if you are working with colleagues not covered by the policy. If you contribution is from a grant covered by it, you must comply with the policy.

Finally, the SPARC Canadian Author Addendum was mentioned several times. I don’t know why Canadian academics have their own specific version of the standard SPARC Author Addendum, but regardless, this is something you should use to retain your rights to your work. What normally happens when you agree with a publisher to publish your paper in a journal is that you sign over your copyright to the work to the publisher. This is starting to change with some publishers whereby you don’t transfer copyright you retain it and instead provide the publisher with an exclusive licence to publish the work. Both of these place restrictions on what you can and can’t do with your own research papers, but signing over copyright limits you the most.

What the SPARC Author Addendum does is provide you with a standard form that you return along side the copyright transfer or other agreement with the publisher, which indicates that the publisher, in agreeing to publish your paper, also agrees to your retaining some key rights, including but not restricted to

the right to reproduced the article for non-commercial purposes,
the right to prepare derivative works (i.e. use figures in other works),
the right to allow others to reproduce the work under non-commercial terms.

In returning the SPARC addendum, you also require that the publisher provide you with a PDF of the publisher version of record which is not encumbered by security or other DRM. There’s even a clause in the addendum that says that if the publisher doesn’t respond to the addendum and publishes your paper, they have agreed to the terms presented in the Addendum.

Regardless of whether you now come under the Tri-Agency Open Access Policy or not, retaining rights using the SPARC Canadian Author Addendum should be something we do anyway.

Whatever you do now, if you are unsure about something regarding the Tri-Agency policy on open access, go and read it (it is short) and read through the FAQs. Speak to people at your institution’s library, and if you still have questions get in touch with the relevant member of the Tri-Agencies.

Even if you aren’t yet required to make your papers available under the terms of the policy, it would be a good use of time to start getting yourself and the other members of your lab into the habit of depositing research publications and familiarising yourselves with open access licences and what restrictions are in place for journals you traditionally publish in etc. before you are required to deposit your works.

If I’ve messed something up here or if anything remains unclear, I’ll do my best to answer any questions in the comments or correct the information stated above.

My aversion to pipes

2015-06-03T00:00:00+02:00

At the risk of coming across as even more of a curmudgeonly old fart than people already think I am, I really do dislike the current vogue in R that is the pipe family of binary operators; e.g. %>%. Introduced by Hadley Wickham and popularised and advanced via the magrittr package by Stefan Milton Bache, the basic idea brings the forward pipe of the F# language to R. At first, I was intrigued by the prospect and initial examples suggested this might be something I would find useful. But as time has progressed and I’ve seen the use of these pipes spread, I’ve grown to dislike the idea altogether. here I outline why.

The forward pipe operator is designed, in R at least (I’m not familiar with F#), to avoid the sort of nested/inline R code of the type shown below

the_data <- head(transform(subset(read.csv('/path/to/data/file.csv'),
                                  variable_a > x),
                           variable_c = variable_a/variable_b),
                 100)

replacing that awful mess with

the_data <-
  read.csv('/path/to/data/file.csv') %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a/variable_b) %>%
  head(100)

And when compared against one another like that, who wouldn’t rejoice at the prospect of a pipe to banish such awful R code to distant memory? The problem with this comparison though is, who writes code like that in the first code block? I don’t think I’ve ever written code like that, even when I was a very green useR around the turn of the century.

When you compare the pipe version with how I’d lay out the R code

the_data <- read.csv('/path/to/data/file.csv')
the_data <- subset(the_data, variable_a > x)
the_data <- transform(the_data, variable_c = variable_a/variable_b)
the_data <- head(the_data, 100) # I'm perplexed as to why this would be a good thing to do?

the benefits of the pipe remain but they aren’t, at least in my opinion, as compelling. My version is verbose; I repeatedly overwrite the_data object with subsequent operations. Rather the writing the_data once in the pipe version, I’d write it 7 times! But that said, I could pass my version to a relative novice useR and they’d have a reasonable grasp of what the code did. I don’t think the same could be said for the pipe version.

But all that really doesn’t matter does it. It’s personal preference as to how you choose to write your data analysis and manipulation R script code. If you find it easier to write code and then read it back using the pipe operator all power to you.

Where I think it does make a difference is where you are

writing code to go into an R package for general consumption on say CRAN, or
writing example material for your package in a vignette or similar document.

I don’t claim that these are the only problem areas nor that these are universally accepted. I wager I’m in the majority position at the moment, but that is probably down to the relatively recent arrival of the pipe on the R scene.

Why is the pipe a problem if you are writing code to go into a general purpose R package that you expect users to abuse with their own data in their own code? Two reasons. The pipe operator involves the standard non-standard evaluation (NSE) paradigm. The pipe captures expressions on each side of the %>% operator and then arranges for the thing on the left of %>% to be injected into the expression on the right of %>%, usually as the first argument but not always. This all involves capturing the expressions and evaluating them within the %>%() function.

OK, isn’t that what all functions using a formula do, or what transform(), subset(), et al do? Well yes, and this is where my spider sense starts tingling. Who among us hasn’t had those things fail on us when we dropped them into an lapply() inside an anonymous function? Or wrapped those function as part of a package function only for some user to execute your function in a way you didn’t envisage? Now Hadley assures us that there is a correct way to do NSE and he even has a package for that, lazyeval. But still I have my reservations, despite Stefan’s attempts to allay my fears

(???) (???) (???) (???) (???) so far none have. You’re welcome to reopen the github issue if you have examples.
— Stefan Milton Bache ((???)) May 28, 2015

OK, let’s assume Stefan and Hadley know what they are doing (and I invariably do) and the NSE used here really is safe. That still leaves the major problem I have with writing R code like this in package functions; how do you read it, parse it, and understand what it does? How do you track down a bug in the code and where it occurs if several steps are conflated into a single pipe chain? I’m not a pipe smoker so I’ll have to guess; you undo the chain and see where things break. Wouldn’t it have been easier to just write out the steps in the first place? That way the debugger can just step through the statements line by line as you’ve written them. I’m not alone in having concerns in this general area

(???) (???) (???) (???) (???) (???) (???) my main worry is that it makes errors harder to understand
— Hadley Wickham ((???)) May 28, 2015

I suppose a lot of this will come down to how well you grok pipes and how well you understand your actual code.

OK, enough of that; on to problem area number 2. I was recently helping a StackOverflow user massage some output from a vegan function into a format suitable for plotting with ggplot2. There, the aim was to go from this:

Group.1                S.obs     se.obs    S.chao1   se.chao1
Cliona celata complex  499.7143  59.32867  850.6860  65.16366
Cliona viridis         285.5000  51.68736  462.5465  45.57289
Dysidea fragilis       358.6667  61.03096  701.7499  73.82693
Phorbas fictitius      525.9167  24.66763  853.3261  57.73494

to this:

                Group.1   var        S       se
1 Cliona celata complex chao1 850.6860 65.16366
2 Cliona celata complex   obs 499.7143 59.32867
3        Cliona viridis chao1 462.5465 45.57289
4        Cliona viridis   obs 285.5000 51.68736
5      Dysidea fragilis chao1 701.7499 73.82693
6      Dysidea fragilis   obs 358.6667 61.03096
7     Phorbas fictitius chao1 853.3261 57.73494
8     Phorbas fictitius   obs 525.9167 24.66763

(or at least something pretty close it) so that the required dynamite plot (yes, yes, I know!) could be produced.

A little fiddling with reshape2 suggested this wasn’t something that it would handle gracefully (I may well be wrong here; I’m not familiar that particular package) and having recalled some details of Hadley’s tidyr package I felt that it would be more suited to the problem at hand. Not having used tidyr I proceeded to CRAN to grab the manual and look at any vignettes that might help me with understanding how to solve this particular problem. Thankfully, Hadley is a conscientious R package maintainer and there was a rather nice HTML-rendered version of the vignette right there on CRAN for me to peruse. The only downside to this was all the example code used pipes.

The very first usage example is (or was, depending on when you are reading this)

library(tidyr)
library(dplyr)
preg2 <- preg %>% 
  gather(treatment, n, treatmenta:treatmentb) %>%
  mutate(treatment = gsub("treatment", "", treatment)) %>%
  arrange(name, treatment)
preg2

Innocuous enough I guess, until you realise that I"m also reading the manual which has usage that doesn’t involve pipes and that Hadley isn’t naming the arguments in the calls here. Now I am having to grok what is being passed, and where, by the pipes, whilst trying to match the usage shown in the example snippet with the arguments in the manual. I might be old-school but yes, I do read the manual.

The point I’m trying to make here with my little anecdote is this; what point did the use of the pipe serve here? How am I as a user new to the package helped by Hadley also using the pipe? In my case I wasn’t; in fact it made it somewhat trickier to understand what went where, what the actual tidyr calls were etc. Now I fully understand that Hadley finds the pipe operator to be very expressive for data analysis, and who am I to argue with that? Where I would raise an issue is that if you are writing introductory example code, don’t force your users to have to grapple with two new concepts at once, at least not in the first few examples.

I don’t want to beat on Hadley over this; it’s just that this was a prime example of where the use of the pipe was obfuscatory not revelatory, for me at least.

So yes, I am a curmudgeonly old fart, but this old dog can learn new tricks. Convince me I’m wrong here cause I really do want to like the pipe; my Granddad smoked one and I have fond memories of the smell and, well, all the cool kids are using the pipe so it must be good, right?

Something is rotten in the state of Denmark

2015-06-02T00:00:00+02:00

On Twitter and elsewhere there has been much wailing and gnashing of teeth for some time over one particular aspect of the R ecosphere: CRAN. I’m not here to argue that everything is peachy — far from it in fact — but I am going to argue that the problems we face do not begin and end with CRAN or one or more of it’s maintainers.

Before I let rip, in writing this I am not attempting to gloss over or otherwise dismiss the real complaints from those that feel that they have been harassed by responses from a CRAN maintainer. It’s not my place to address those issues, but rather something that the R Foundation should be handling. If true, and I have no reason to doubt the claims, there is no place for such treatment of individuals, no matter their transgression. Did you hear me? Ok, with that said, here goes.

For all the good that there is in the R community, one part of the rot that exists is with package authors. Not all package authors, mind. Just a few package authors. Some of those same people seem very vocal on Twitter and elsewhere about the perceived problems and question why CRAN has the temerity to uphold them to some quality standards. The rot, or at least a not-insignificant part of it, is those package authors that don’t give a crap about the quality of their submissions or those that don’t think the rules apply to them.

There is nothing mystical or random about getting a package on to CRAN. You create your package following the guidelines & advice in Writing R Extensions (WRE), which, whilst verbose in places it doesn’t need to be, at least includes most of the relevant information if people would just both to read it. I hear complaints about it being some hundred and odd pages and that people don’t have time to read it. Wait, you don’t have time to read the documentation that is provided but then get all bent out of shape when a volunteer CRAN Maintainer calls you on your lack of effort?

Big chunks of WRE don’t apply to most packages; not including C, C++, or FORTRAN code in your package? Great, ignore the 60% of the manual that doesn’t apply to you. By my reckoning there are on the order of 70 sparse pages that cover all you need to know about writing an R package, conveniently listed in the first 2 chapters of WRE. Add 2 more pages if you want to write new generic functions and methods. How many of those complaining read the 100-odd pages of Hadley Wickham’s R Packages book (or the equivalent web/HTML version)?

That information, those 70 pages, is what most package authors need. Yes, OK, some people will be proficient programmers writing interfaces to compiled code who’ll need to read the other 60%, but I sure as hell hope they do read it because I’d really appreciate it if their compiled code didn’t segfault my R session just because I had the nerve to use their package.

If you’ve gotten your code this far, you should have a reasonably functioning package. Next step is to do what WRE tells you and run R CMD check –as-cran and R CMD INSTALL on the tarball (i.e. on the thing produced by R CMD build, not your source tree). If there are any issues here, fix them or make a note to tell CRAN about the issue and why it is either a false positive or nothing to worry about. This is important! A lot of what CRAN does is manual; help them by telling them why they shouldn’t worry that your package generated 3 NOTEs. You probably want to check this on at least two OSes (Linux and Windows would be ideal) and under the current R release and a recent build of R-devel. The latter may be a bit of a pain but you only really need to do this when you are doing pre-flight checks for CRAN, not at all stages of development. Using the Win-builder service run by Uwe Ligges will cover Windows and R-devel on that OS to boot. Using a continuous integration service like Travis CI or Appveyor can help with testing on Linux/OS X and Windows respectively. Using these fancy new tools isn’t that technical, difficult, or insurmountable; if you are building a package in the first place you already have access to one test system and Win-builder gives you another, for free and you get the R-devel ribbon on top!

Having done all of that, you need to read the CRAN policy for submissions. And re-read it. And read it each and every time you submit a package to CRAN, not just on the first occasion. It changes from time to time to reflect tightening of the policy or to accommodate changes in R and the checking system.

That done, you should be good to go. But, yes sometimes you’ll have overlooked something, or your test systems weren’t configured in the same way as CRAN’s were. Or something else. If you’ve read the policy and followed WRE, then this is the one place where some error might creep in. But you know what, if CRAN tells you to single-quote some words, or title case your Title tag, or put a period on the end of the Description tag, or something else, just fix the damn problem and get on with life. You might think this is petty, and from some points of view it probably is petty, but CRAN don’t and if you want your package on CRAN then you have to follow their rules! It really is that simple. You don’t like it? You’re welcome to a refund and can always set up a competing repository yourself.

Some package authors complain that CRAN is sweating the insignificant details at the expense of letting through compliant-but-pointless or crap-or-broken packages. Oliver Keyes recently commented on Twitter

Let me say a big thank you to BDR for shouting at me for not using single quotes but not noticing fundamentally broken vignettes
— Oliver Keyes ((???)) May 31, 2015

complaining about being asked to single-quote something or other whilst broken vignettes seemingly languish on CRAN (it’s not immediately clear whether Oliver was referring to his own vignettes or something from another package). The implication here seems to be that BDR is somehow being remiss in pointing out one transgression of the rules whilst simultaneously allowing other, more serious, transgressions. This is invariably a false argument of course. If BDR #sweat[s]theshitthatdoesntmatter (as (???) succinctly put it), he sure as hell isn’t letting an obviously — visibly — broken vignette through the pearly gates now is he? Of course not!

If there are broken vignettes/packages on CRAN there are two reasons

the author of the package doesn’t care about fixing the vignette/package but it is broken in a non-obvious-to-CRAN way, or
the package author’s vignette/package has broken due to recent changes in dependencies or R, (OK I guess there’s a 3rd option…
the package author doesn’t know and, you know what, perhaps a friendly note to tell them would suffice )

If the reality is 2. and the package is still on CRAN, then be thankful that CRAN is probably allowing a period of grace for the package author to fix the problem. Can you imagine the cacophony of wailing from the twitteRati if CRAN pulled their packages as soon as R CMD check threw an error? You might mistake such an event for the rapture…

If the problem is 1. then what do you want BDR or CRAN to do about it? They get lambasted for too much reliance on manual checks and then when their automated checks fail to catch a problem they’re damned again! If the problem is 1. then the ire should be directed at the respective package author, not CRAN. The problem of broken vignettes etc is not something CRAN can do much about; that contribution to the rottenness lies squarely at the feet of R package authors.

I don’t know for sure, but I can see reasons for CRAN wanting to improve the way packages are presented and described on CRAN’s website. That they sweat these seemingly trivial details because if package authors get those things right, they’re probably conscientious enough to make sure there aren’t other, undetectable-to-CRAN problems with their package. We should be lauding this attention to detail if the effort in quoting a few words and changing the case of some title or other is what stops idiots from throwing-up whatever it was they ate for lunch into a tarball destined for CRAN.

If you follow the CRANberries package feed you’ll be amazed at the number of packages that get yanked from CRAN; invariably because some problem was found later in their package or changes to R broke the package and the author failed to sort the problem. This all has to be handled responsibly by CRAN because they invariably have a legal obligation to continue to make the sources for those removed packages available for download. This is not a trivial exercise to dump this garbage in a responsible manner, with a human-negotiated time interval within which a problem will be fixed (note the cacophony point above). Raising the barrier to entry for R packages shipped via CRAN is, in my not so humble opinion, a good thing if it weeds out those that can’t be arsed with the effort involved in jumping through the ever-shifting hoops of WRE and CRAN’s policy.

So, yes there is a problem in the R community. It’s just not the entity that you all thought was the problem, at least not entirely. If the rot has set in, if the sickness has infected the community, package authors are very much partly to blame. There is no secret sauce to getting a package on to CRAN, despite what some people might think or claim. The only cure to the sickness is to sweat the detail, read the documentation, do what CRAN says. If you don’t like that, then go play somewhere else. There are hundreds, if not thousands, of package authors that have successfully navigated the treacherous waters that lie before CRAN’s safe harbour. You know what each of these package authors has in common? They (eventually) read the documentation and played by the rules. What makes you so special that you should get a free pass on that?

Drawing rarefaction curves with custom colours

2015-04-16T00:00:00+02:00

I was sent an email this week by a vegan user who wanted to draw rarefaction curves using rarecurve() but with different colours for each curve. The solution to this one is quite easy as rarecurve() has argument col so the user could supply the appropriate vector of colours to use when plotting. However, they wanted to distinguish all 26 of their samples, which is certainly stretching the limits of perception if we only used colour. Instead we can vary other parameters of the plotted curves to help with identifying individual samples.

To illustrate, I’ll use the Barro Colorado Island data set BIC that comes with vegan. I just take the first 26 samples as this was the data set size my correspondent indicated they had available.

library("vegan")

Loading required package: permute
Loading required package: lattice
This is vegan 2.2-1

data(BCI, package = "vegan")
BCI2 <- BCI[1:26, ]
raremax <- min(rowSums(BCI2))
raremax

[1] 340

raremax is the minimum sample count achieved over the 26 samples. We will rarefy the sample counts to this value.

To set up the parameters we might use for plotting, expand.grid() is a useful helper function

col <- c("black", "darkred", "forestgreen", "orange", "blue", "yellow", "hotpink")
lty <- c("solid", "dashed", "longdash", "dotdash")
pars <- expand.grid(col = col, lty = lty, stringsAsFactors = FALSE)
head(pars)

          col   lty
1       black solid
2     darkred solid
3 forestgreen solid
4      orange solid
5        blue solid
6      yellow solid

Then we can call rarecurve() as follows with the new graphical parameters

out <- with(pars[1:26, ],
            rarecurve(BCI2, step = 20, sample = raremax, col = col,
                      lty = lty, label = FALSE))

First attempt at rarefaction curves with custom colours.

Note that I saved the output from rarecurve() in object out. This object contains everything we need to draw our own version of the plot if we wish. For example, we could use fewer colours and alter the line thickness¹ instead to make up the required number of combinations.

col <- c("black", "darkred", "forestgreen", "hotpink", "blue")
lty <- c("solid", "dashed", "dotdash")
lwd <- c(1, 2)
pars <- expand.grid(col = col, lty = lty, lwd = lwd, 
                    stringsAsFactors = FALSE)
head(pars)

          col    lty lwd
1       black  solid   1
2     darkred  solid   1
3 forestgreen  solid   1
4     hotpink  solid   1
5        blue  solid   1
6       black dashed   1

Using the information in out returned by rarecurve() we can get almost the same plot using the following code to draw the elements by hand

Nmax <- sapply(out, function(x) max(attr(x, "Subsample")))
Smax <- sapply(out, max)
plot(c(1, max(Nmax)), c(1, max(Smax)), xlab = "Sample Size",
     ylab = "Species", type = "n")
abline(v = raremax)
for (i in seq_along(out)) {
    N <- attr(out[[i]], "Subsample")
    with(pars, lines(N, out[[i]], col = col[i], lty = lty[i], lwd = lwd[i]))
}

Second attempt at rarefaction curves with custom colours and plotting.

Having done this, I don’t believe this is a useful graphic because we’re trying to distinguish between too many samples using graphical parameters. Where I do think this sort of approach might work is if the samples in the data set come from a few different groups and we want to colour the curves by group.

col <- c("darkred", "forestgreen", "hotpink", "blue")
set.seed(3)
grp <- factor(sample(seq_along(col), nrow(BCI2), replace = TRUE))
cols <- col[grp]

The code above creates a grouping factor grp for illustration purposes; in real analyses you’d have this already as a factor variable in you data somewhere. We also have to expand the col vector because we are plotting each line in a loop. The plot code, reusing elements from the previous plot, is shown below:

plot(c(1, max(Nmax)), c(1, max(Smax)), xlab = "Sample Size",
     ylab = "Species", type = "n")
abline(v = raremax)
for (i in seq_along(out)) {
    N <- attr(out[[i]], "Subsample")
    lines(N, out[[i]], col = cols[i])
}

An attempt at rarefaction curves output with custom colours per groups of curves.

We can’t use the approach outlined in this example to vary lwd because of the way rarecurve() draws the individual curves, in a loop. We have no way to tell rarecurve() to use the ith element of a vector of lwd values.↩

At the frontiers of palaeoecology

2015-03-31T00:00:00+02:00

A couple of weeks ago, I had the pleasure of attending and participating in a symposium held to honour John Birks as he retires from the University of Bergen and becomes Professor Emeritus. The symposium, titled “At the Frontiers of Palaeoecology”, took place on 19–20th March in Bergen, Norway, and was a wonderful mix of colleagues old and new discussing John’s contributions to the field of palaeoecology and their collaborations with him. Alongside this reminiscing were several presentations describing new areas of research by colleagues and collaborators of John.

I gave a talk on the first day, which was of the latter type. I made the case for the wider use among palaeoecologists of modern statistical methods that allow us to handle palaeoecological data as time series. For the most part, limited consideration has been given to the temporal aspects of stratigraphic data, principally because classical time series methods assume equally spaced observations and our dating methods come with considerable errors attached.

The slides from my presentation are available via Figshare.

Harvesting Canadian climate data

2015-01-14T00:00:00+01:00

In December I found myself helping one of our graduate students with a data problem; for one of their thesis chapters they needed a lot of hourly climate data for a handful of stations around Saksatchewan. All of this data was and is available for download from the Government of Canada’s website, but with one catch; you had to download the hourly data one month at a time, manually! There is no interface to allow a user of the website to specify the data range they want and download all the data from a single station. I figured there had to be a better way, using R to automate the downloading. Thinking the solution I came up with might help other researchers needing to grab data from the Government of Canada’s website save some time in the future, I wrote this post to document how we ended up doing it.

Screenshot of Government of Canada’s climate website

The website itself is reasonably pretty but the way the web form worked to trigger the download of a CSV containing the data was a little tricky. You can see an example of the sort of data we were interested in here; interestingly you are only shown a single day of data but when you click the big Download button you get the entire month containing the day shown in the HTML table. The web form was setting some hidden parameters that were added to the current page’s URL once the Download button was clicked. Frustratingly, the same page that showed the HTML table also handled generating and returning the CSV download. Even more frustrating was that the script that they were using needed GET variables with almost the same names as some of the existing GET variables, just with different case, such as StationID and stationID, the latter of which is required for the CSV-creating script only. A further annoyance was that even though the CSV generated contained an entire month’s worth of data, the URL still needed to contain the Day GET variable.

I’m sure I haven’t whittled the URL down to the bare minimum required to trigger CSV generation and download, but I ended up using:

http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1996-01-30%7C2014-11-30&cmdB1=Go&Year=2003&Month=5&Day=27&format=csv&stationID=28011

which will get you the data for May 2003 from station 28011 (Regina RCS).

Having figured that out, I needed a little function that would generate the URLs we’d need to visit to get data covering the periods we wanted. Because the student needed multiple stations and the time periods of interest differed between stations (stations got moved and picked up new IDs so we needed to track those movements) I wrote a little function that would create a whole load of URLS if given a set of station IDs and start and end years.

genURLS <- function(id, start, end) {
    years <- seq(start, end, by = 1)
    nyears <- length(years)
    years <- rep(years, each = 12)
    months <- rep(1:12, times = nyears)
    URLS <- paste0("http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=",
                   id,
                   "&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=",
                   years,
                   "&Month=",
                   months,
                   "&Day=27",
                   "&format=csv",
                   "&stationID=",
                   id)
    list(urls = URLS, ids = rep(id, nyears * 12), years = years, months = months)
}

The genURLS() function is pretty simple and just repeats each year integer in the sequence start:end 12 times, once per month, and then repeats the months 1:12 for as many years were requested. Then it builds up a character vector of URLs from these vectors years, months and id, the station ID.

If we wanted all the data for 2014 for the Regina RCS station then we could generate the URLs we’d need to visit as follows

regina <- genURLS(28011, 2014, 2014)
length(regina$urls)
head(regina$urls)

[1] 12
[1] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=1&Day=27&format=csv&stationID=28011"
[2] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=2&Day=27&format=csv&stationID=28011"
[3] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=3&Day=27&format=csv&stationID=28011"
[4] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=4&Day=27&format=csv&stationID=28011"
[5] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=5&Day=27&format=csv&stationID=28011"
[6] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=6&Day=27&format=csv&stationID=28011"

The function I used to grab all the data is a little more involved, partly because in a long-running job you don’t want a single error due to a bad download to cause the entire job to end. Another reason for some of the complexity is that if the job did fail for some reason, as long as the files downloaded up to that point were OK/readable, I didn’t want to download them again. Therefore the function downloads and saves all the CSV files first and only then do we try to read the data. The function is reasonably well-commented so I won’t dwell on those details

getData <- function(stations, folder, verbose = TRUE, delete = TRUE) {
    ## form URLS
    urls <- lapply(seq_len(NROW(stations)),
                   function(i, stations) {
                       genURLS(stations$StationID[i],
                               stations$start[i],
                               stations$end[i])
                   }, stations = stations)
 
    ## check the folder exists and try to create it if not
    if (!file.exists(folder)) {
        warning(paste("Directory:", folder,
                      "doesn't exist. Will create it"))
        fc <- try(dir.create(folder))
        if (inherits(fc, "try-error")) {
            stop("Failed to create directory '", folder,
                 "'. Check path and permissions.", sep = "")
        }
    }
 
    ## Extract the data from the URLs generation
    URLS <- unlist(lapply(urls, '[[', "urls"))
    sites <- unlist(lapply(urls, '[[', "ids"))
    years <- unlist(lapply(urls, '[[', "years"))
    months <- unlist(lapply(urls, '[[', "months"))
 
    ## filenames to use to save the data
    fnames <- paste(sites, years, months, "data.csv", sep = "-")
    fnames <- file.path(folder, fnames)
 
    nfiles <- length(fnames)
 
    ## set up a progress bar if being verbose
    if (isTRUE(verbose)) {
        pb <- txtProgressBar(min = 0, max = nfiles, style = 3)
        on.exit(close(pb))
    }
 
    out <- vector(mode = "list", length = nfiles)
    cnames <- c("Date/Time", "Year", "Month","Day", "Time", "Data Quality",
                "Temp (degC)", "Temp Flag", "Dew Point Temp (degC)",
                "Dew Point Temp Flag", "Rel Hum (%)", "Rel Hum Flag",
                "Wind Dir (10s deg)", "Wind Dir Flag", "Wind Spd (km/h)",
                "Wind Spd Flag", "Visibility (km)", "Visibility Flag",
                "Stn Press (kPa)", "Stn Press Flag", "Hmdx", "Hmdx Flag",
                "Wind Chill", "Wind Chill Flag", "Weather")
 
    for (i in seq_len(nfiles)) {
        curfile <- fnames[i]
 
        ## Have we downloaded the file before?
        if (!file.exists(curfile)) {    # No: download it
            dload <- try(download.file(URLS[i], destfile = curfile, quiet = TRUE))
            if (inherits(dload, "try-error")) { # If problem, store failed URL...
                out[[i]] <- URLS[i]
                if (isTRUE(verbose)) {
                    setTxtProgressBar(pb, value = i) # update progress bar...
                }
                next                             # bail out of current iteration
            }
        }
 
        ## Must have downloaded, try to read file
        ## skip first 16 rows of header stuff
        ## encoding must be latin1 or will fail - may still be problems with character set
        cdata <- try(read.csv(curfile, skip = 16, encoding = "latin1"), silent = TRUE)
 
        ## Did we have a problem reading the data?
        if (inherits(cdata, "try-error")) { # yes handle read problem
            ## try to fix the problem with dodgy characters
            cdata <- readLines(curfile) # read all lines in file
            cdata <- gsub("\x87", "x", cdata) # remove the dodgy symbol for partner data in Data Quality
            cdata <- gsub("\xb0", "deg", cdata) # remove the dodgy degree symbol in column names
            writeLines(cdata, curfile)          # write the data back to the file
            ## try to read the file again, if still an error, bail out
            cdata <- try(read.csv(curfile, skip = 16, encoding = "latin1"), silent = TRUE)
            if (inherits(cdata, "try-error")) { # yes, still!, handle read problem
                if (delete) {
                    file.remove(curfile) # remove file if a problem & deleting
                }
                out[[i]] <- URLS[i]    # record failed URL...
                if (isTRUE(verbose)) {
                    setTxtProgressBar(pb, value = i) # update progress bar...
                }
                next                  # bail out of current iteration
            }
        }
 
        ## Must have (eventually) read file OK, add station data
        cdata <- cbind.data.frame(StationID = rep(sites[i], NROW(cdata)),
                                  cdata)
        names(cdata)[-1] <- cnames
        out[[i]] <- cdata
 
        if (isTRUE(verbose)) { # Update the progress bar
            setTxtProgressBar(pb, value = i)
        }
    }
 
    out                                 # return
}

To see getData() in action, we’ll run a quick job, downloading the 2014 data for two stations

Regina INTL A (51441)
Indian Head CDA (2925)

First we create a data frame of station information

stations <- data.frame(StationID = c(51441, 2925),
                       start = rep(2014, 2),
                       end = rep(2014, 2))

Then we pass this to getData() with the path to the folder we wish to cache downloaded CSVs in

met <- getData(stations, folder = "./csv", verbose = FALSE)

Once this has finished, we can quickly determine if there were any failures

any(failed <- sapply(met, is.character))

[1] FALSE

If any had failed, the failed logical vector could be used to index into met to extract the URLs that encountered problems, e.g.

unlist(met[failed])

If there were no problems, then the components of met can be bound into a data frame using rbind()

met <- do.call("rbind", met)

The data now looks like this

head(met)

  StationID        Date.Time Year Month Day  Time Data.Quality Temp...C.
1     51441 2014-01-01 00:00 2014     1   1 00:00           **     -23.3
2     51441 2014-01-01 01:00 2014     1   1 01:00           **     -23.1
3     51441 2014-01-01 02:00 2014     1   1 02:00           **     -22.8
4     51441 2014-01-01 03:00 2014     1   1 03:00           **     -23.3
5     51441 2014-01-01 04:00 2014     1   1 04:00           **     -24.3
6     51441 2014-01-01 05:00 2014     1   1 05:00           **     -24.3
  Temp.Flag Dew.Point.Temp...C. Dew.Point.Temp.Flag Rel.Hum....
1                         -26.3                              77
2                         -26.1                              77
3                         -25.8                              77
4                         -26.3                              77
5                         -27.1                              78
6                         -27.0                              79
  Rel.Hum.Flag Wind.Dir..10s.deg. Wind.Dir.Flag Wind.Spd..km.h.
1                              13          <NA>              22
2                              12          <NA>              26
3                              12          <NA>              22
4                              13          <NA>              18
5                              13          <NA>              14
6                               9          <NA>               6
  Wind.Spd.Flag Visibility..km. Visibility.Flag Stn.Press..kPa.
1                          19.3            <NA>           95.38
2                          24.1            <NA>           95.38
3                          24.1            <NA>           95.39
4                          24.1            <NA>           95.47
5                          24.1            <NA>           95.56
6                          24.1            <NA>           95.60
  Stn.Press.Flag Hmdx Hmdx.Flag Wind.Chill Wind.Chill.Flag
1                  NA        NA        -35              NA
2                  NA        NA        -36              NA
3                  NA        NA        -35              NA
4                  NA        NA        -34              NA
5                  NA        NA        -34              NA
6                  NA        NA        -30              NA
            Weather
1 Snow,Blowing Snow
2 Snow,Blowing Snow
3 Snow,Blowing Snow
4 Snow,Blowing Snow
5              Snow
6              <NA>

Yep, a bit of a mess; some post processing is required if you want tidy names etc. The student was only interested in temperature and relative humidity so I dropped all the other met data and data quality columns and then only had to update a few variable names. ~~I purposely didn’t have getData() fix this in case the data format on the Government of Canada’s climate website changes.~~ Update I had to change this behaviour to allow getData() to process some degenerate CSV files with odd characters in the column name data and the data quality field (see the comments for details). The columns names are hardcoded but retain the messy names as given to them by the Government of Canada’s webmaster. Cleaning up afterwards is advised still.

If you have any suggestions for improvements or changes, let me know in the comments. The latest versions of the genURLS() and getData() functions can be found in this Github gist.

Analysing a randomised complete block design with vegan

2014-11-03T00:00:00+01:00

It has been a long time coming. Vegan now has in-built, native ability to use restricted permutation designs when testing effects in constrained ordinations and in range of other methods. This new-found functionality comes courtesy of Jari (mainly) and my efforts to have vegan permutation routines use the permute package. Jari also cooked up a standard interface that we can use to drop this and some extra features neatly into any function we want; this allows us to have permutation tests run on many CPU cores in parallel, splitting the computational burden and reducing the run time of tests, and also a mechanism that allows users to pass a matrix of user-defined permutations to be used in tests. These new features are now fully working in the development version of vegan, which you can find on github, and which should be released to CRAN shortly. Ahead of the release, I’m preparing some examples to show off the new capabilities; first off I look at data from a randomized, complete block design experiment analysed using RDA & restricted permutations.

To follow this example locally you’ll need to have version 2.1-43 or later of vegan installed. You can grab the sources from github and build it yourself, or grab a Windows binary from the Appveyor Continuous integration service that we’re using to test on that platform — you want the .zip file from the Artefacts. Once you’ve sorted out the installation, we can begin.

library("vegan")

Loading required package: permute
Loading required package: lattice
This is vegan 2.1-43

library("gdata")

gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: 'gdata'

The following object is masked from 'package:stats':

    nobs

The following object is masked from 'package:utils':

    object.size

We’ll need gdata, and its read.xls() function, to read from the XLS format files that the data for the example come as.

The data set itself is quite simple and small, consisting of counts on 23 species from 16 plots, and arise from a randomised complete block designed experiment described by Špačková and colleagues (1998) and analysed by (Šmilauer and Lepš, 2014) in their recent book using Canoco v5.

The experiment tested the effects on seedling recruitment to a range of treatments

control
removal of litter
removal of the dominant species Nardus stricta
removal of litter and moss (mos couldn’t be removed without also removing litter)

The treatments were replicated replicated in four, randomised complete blocks.

The data are available from the accompanying website to the book Multivariate Analysis of Ecological Data using CANOCO 5 (Šmilauer and Lepš, 2014). They are supplied as XLS format files in a ZIP archive. We can read these into R directly from the website with a little bit of effort

## Download the data zip
furl <- "http://regent.prf.jcu.cz/maed2/chap15.zip"
td <- tempdir()
tf <- tempfile(tmpdir = td, fileext = ".zip")
download.file(furl, tf)

## list the files in the zip, we want the xls version (file 3)
fname <- unzip(tf, list = TRUE)$Name[3]
unzip(tf, files = fname, exdir = td, overwrite = TRUE) # unzip
datpath <- file.path(td, fname)                        # path to xls

## read the xls file, sheet 2 contains species data, sheet 3 the env
spp <- read.xls(datpath, sheet = 2, skip = 1, row.names = 1)
env <- read.xls(datpath, sheet = 3, row.names = 1)

The block variable is currently coded as an integer and needs converting to a factor if we are to use it correctly in the analysis

env <- transform(env, block = factor(block))

The gradient lengths are short,

decorana(spp)

Call:
decorana(veg = spp) 

Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.

                  DCA1   DCA2    DCA3    DCA4
Eigenvalues     0.1759 0.1898 0.11004 0.05761
Decorana values 0.2710 0.1822 0.07219 0.02822
Axis lengths    1.9821 1.4140 1.15480 0.87680

motivating the use of redundancy analysis (RDA). Additionally, we may be interested in how the raw abundance of seedlings change following experimental manipulation, o we may wish to focus on the proportional differences between treatments. The first case is handled naturaly by RDA. The second case will require some form of standardisation by samples, say by sample totals.

First, let’s test the first null hypothesis; that there is no effect of the treatment on seedling recruitment. This is a simple RDA. We should take into account the block factor when we assess this model for significance. How we do this illustrates two potential approaches to performing permutation tests

design-based permutations, where how the samples are permuted follows the experimental design, or
model-based permutations, where the experimental design is included in the analysis directly and residuals are permuted by simple randomisation.

There is an important difference between the two approach, one which I’ll touch on shortly.

We’ll proceed by fitting the model, conditioning on block to remove between block differences

mod1 <- rda(spp ~ treatment + Condition(block), data = env)
mod1

Call: rda(formula = spp ~ treatment + Condition(block), data =
env)

               Inertia Proportion Rank
Total         990.8000     1.0000     
Conditional   166.1000     0.1676    3
Constrained   329.8000     0.3329    3
Unconstrained 494.9000     0.4995    9
Inertia is variance 

Eigenvalues for constrained axes:
  RDA1   RDA2   RDA3 
284.81  30.83  14.20 

Eigenvalues for unconstrained axes:
   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8    PC9 
226.83 139.51  72.77  30.11   9.81   9.14   2.80   2.19   1.73

There is a strong single, linear gradient in the data as evidenced by the relative magnitudes of the eigenvalues (here expressed as proportions of the total variance)

eigenvals(mod1) / mod1$tot.chi

      RDA1       RDA2       RDA3        PC1        PC2        PC3 
0.28746238 0.03111202 0.01432998 0.22893569 0.14080915 0.07344450 
       PC4        PC5        PC6        PC7        PC8        PC9 
0.03038815 0.00989932 0.00922185 0.00282396 0.00221132 0.00174669

Design-based permutations

A design-based permutation test of these data would be on conditioned on the block variable, by restricting permutation of sample only within the levels of block. In this situation, samples are never permuted between blocks, only within. We can set up this type of permutation design as follows

h <- how(blocks = env$block, nperm = 999)

Note that we could use the plots argument instead of blocks to restrict the permutations in the same way, but using blocks is simpler. I also set the required number of permutations for the test here.

Constrained ordinations in vegan are tested using the anova() function. New in the development version of the package is the permutations argument, which is the key to supplying instructions on how you want to permute to anova(). permutations can take a number of different types of instruction

an object of class “how”, whch contains details of a restricted permutation design that shuffleSet() from the permute package will use to generate permutations from, or
a number indicating the number of permutations required, in which case these are simple randomisations with no restriction, unless the strata argument is used, or
a matrix of user-specified permutations, 1 row per permutation.

To perform the design-based permutation we’ll pass h, created earlier, to anova()

set.seed(42)
p1 <- anova(mod1, permutations = h, parallel = 3)
p1

Permutation test for rda under reduced model
Blocks:  env$block 
Permutation: free
Number of permutations: 999

Model: rda(formula = spp ~ treatment + Condition(block), data = env)
         Df Variance      F Pr(>F)  
Model     3   329.84 1.9995  0.086 .
Residual  9   494.88                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that I’ve run this on three cores in parallel; this is another new feature of the development version of vegan and can considerably reduce the time needed to run permutation tests. I have four cores on my laptop but left one free for the other software I have running.

The overall permutation test indicates no significant effect of treatment on the abundance of seedlings. We can test individual axes by adding by = “axis” to the anova() call

set.seed(24)
p1axis <- anova(mod1, permutations = h, parallel = 3, by = "axis")

Loading required package: parallel

p1axis

Permutation test for rda under reduced model
Marginal tests for axes
Blocks:  env$block 
Permutation: free
Number of permutations: 999

Model: rda(formula = spp ~ treatment + Condition(block), data = env)
         Df Variance      F Pr(>F)  
RDA1      1   284.81 5.1797  0.018 *
RDA2      1    30.83 0.5606  0.691  
RDA3      1    14.20 0.2582  0.923  
Residual  9   494.88                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This confirms the earlier impression that there is a single, linear gradient in the data set. A biplot shows that this axis of variation is associated with the Moss (& Litter) removal treatment. The variation between the other treatments lies primarily along axis two and is substantially less than that associated with the Moss & Litter removal.

plot(mod1, display = c("species", "cn"), scaling = 1, type = "n",
     xlim = c(-10.5, 1.5))
text(mod1, display = "species", scaling = 1, cex = 0.8)
text(mod1, display = "cn", scaling = 1, col = "blue", cex = 1.2,
     labels = c("Control", "Litter+Moss", "Litter", "Removal"))

Figure 1: RDA biplot showing species scores and treatment centroids.

In the above figure, I used scaling = 1, so-called inter-sample distance scaling, as this best represents the centroid scores, which are computed as the treatment-wise average of the sample scores.

Model-based permutation

The alternative permutation approach, known as model-based permutations, and would employ free permutation of residuals after the effects of the covariables have been accounted for. This is justified because under the null hypothesis, the residuals are freely exchangeable once the effects of the covariables are removed. There is a clear advantage of model-based permutations over design-based permutations; where the sample size is small, as it is here, there tends to be few blocks and the resulting design-based permutation test relatively weak compared to the model-based version.

It is simple to switch to model-based permutations, be setting the blocks indicator in the permutation design to NULL, removing the blocking structure from the design

setBlocks(h) <- NULL                    # remove blocking
getBlocks(h)                            # confirm

NULL

Next we repeat the permutation test using the modified h

set.seed(51)
p2 <- anova(mod1, permutations = h, parallel = 3)
p2

Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999

Model: rda(formula = spp ~ treatment + Condition(block), data = env)
         Df Variance      F Pr(>F)  
Model     3   329.84 1.9995  0.068 .
Residual  9   494.88                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated p value is slightly smaller now. The difference between treatments is predominantly in the Moss & Litter removal with differences between the control and the other treatments lying along the insignificant axes

set.seed(83)
p2axis <- anova(mod1, permutations = h, parallel = 3, by = "axis")
p2axis

Permutation test for rda under reduced model
Marginal tests for axes
Permutation: free
Number of permutations: 999

Model: rda(formula = spp ~ treatment + Condition(block), data = env)
         Df Variance      F Pr(>F)   
RDA1      1   284.81 5.1797  0.010 **
RDA2      1    30.83 0.5606  0.735   
RDA3      1    14.20 0.2582  0.960   
Residual  9   494.88                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Chages in relative seedling composition

As mentioned earlier, interest is also, perhaps predominantly, in whether any of the treatments have different species composition. To test this hypothesis we standardise by the sample (row) norm using decostand(). Alternatively we could have used method = “total” to work with proportional abundances. We then repeat the earlier steps, this time using only model-based permutations owing to their greater power.

spp.norm <- decostand(spp, method = "normalize", MARGIN = 1)

mod2 <- rda(spp.norm ~ treatment + Condition(block), data = env)
mod2
eigenvals(mod2) / mod2$tot.chi

set.seed(76)
anova(mod2, permutations = h, parallel = 3)

Call: rda(formula = spp.norm ~ treatment + Condition(block), data
= env)

              Inertia Proportion Rank
Total          0.3726     1.0000     
Conditional    0.0814     0.2184    3
Constrained    0.0725     0.1945    3
Unconstrained  0.2188     0.5871    9
Inertia is variance 

Eigenvalues for constrained axes:
   RDA1    RDA2    RDA3 
0.04517 0.01718 0.01012 

Eigenvalues for unconstrained axes:
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.08026 0.07074 0.02860 0.01916 0.00989 0.00585 0.00223 0.00167 0.00038 

      RDA1       RDA2       RDA3        PC1        PC2        PC3 
0.12123276 0.04610541 0.02716385 0.21539133 0.18983329 0.07675497 
       PC4        PC5        PC6        PC7        PC8        PC9 
0.05140906 0.02655227 0.01570519 0.00597888 0.00447093 0.00101031 
Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999

Model: rda(formula = spp.norm ~ treatment + Condition(block), data = env)
         Df Variance      F Pr(>F)
Model     3 0.072475 0.9939  0.449
Residual  9 0.218768

The results suggest no difference in species composition under the experimental manipulation.

That’s it for this post. In the next example I’ll take a look at a more complex example, one where model-based permutations can’t be used to test all the hypotheses we might want to in an experimental design.

References

Šmilauer, P., and Lepš, J. (2014). Multivariate analysis of ecological data using CANOCO 5. 2 edition. Cambridge University Press.

Špačková, I., Kotorová, I., and Lepš, J. (1998). Sensitivity of seedling recruitment to moss, litter and dominant removal in an oligotrophic wet meadow. Folia geobotanica 33, 17–30. doi:10.1007/BF02914928.

analogue 0.14-0 released

2014-10-14T00:00:00+02:00

A couple of week’s ago I packaged up a new release of analogue, which is available from CRAN. Version 0.14-0 is a smaller update than the changes released in 0.12-0 and sees a continuation of the changes to dependencies to have packages in Imports rather than Depends. The main development of analogue now takes place on github and bugs and feature requests should be posted there. The Travis continuous integration system is used to automatically check the package as new code is checked in. There are several new functions and methods and a few bug fixes, the details of which are given below.

The main user-visible change over 0.12-0 is the deprecation of the plot3d.prcurve() method. The functionality is now in new function Plot3d() and plot3d.prcurve() is deprecated and if called needs to use the full function name. This change is to make analogue easier to install on MacOS X as now rgl is not needed to install analogue. If you want to plot the principal curve in an interactive 3d view, you’ll need to get rgl installed first.

New features

n2() is a new utility function to calculate Hill’s N2 for sites (sample) & species (variables).
optima() can now compute bootstrap WA optima and uncertainty.
performance() has a new method for objects of class“crossval”.
timetrack() had several improvements including a new predict() method, which allows further points to be added to an existing timetrack, a points() method to allow the addition of data to an existing timetrack plot, and the plot() method can create a blank plotting region allowing greater customisation.
prcurve() gets predict() and fitted() methods to predict locations of new samples on the principal curve and extract the locations of the training samples respectively.
evenSample is a utility function to look at the evenness of the distribution of samples along a gradient.
Data sets Pollen, Biome, Climate, and Location from the North American Modern Pollen Database have been updated to version 1.7.3.

Bug fixes

The calculation of AUC in roc() wasn’t working correctly in some circumstances with just a couple of groups.
crossval.pcr() had a number of bugs in the k-fold CV routine which were leading to errors and the function not working.

The progress bar was not being updated correctly either.
predict.pcr() was setting argument ncomp incorrectly if not supplied by the user.
ChiSquare() wasn’t returning the transformation parameters required to transform leave-out data during crossvalidation or new samples for which predictions were required.
plot3d.prcurve() was not using the data and ordination components of the returned object. Note this function is now deprecated.
predict.pcr() was incorrectly calling the internal function fitPCR with the ::: operator.

Deprecated

plot3d.prcurve() is deprecated. Functionality is in new function Plot3d(). Note: in the next version of analogue, this functionality will be removed entirely and located in a new package analogueExtra.

Simulating species abundance data with coenocliner

2014-07-31T00:00:00+02:00

Coenoclines are, according to the Oxford Dictionary of Ecology (Allaby, 1998), “gradients of communities (e.g. in a transect from the summit to the base of a hill), reflecting the changing importance, frequency, or other appropriate measure of different species populations”. In much ecological research, and that of related fields, data on these coenoclines are collected and analyzed in a variety of ways. When developing new statistical methods or when trying to understand the behaviour of existing methods, we often resort to simulating data with known pattern or structure and then torture whatever method is of interest with the simulated data to tease out how well methods work or where they breakdown. There’s a long history of using computers to simulate species abundance data along coenoclines but until recently no R packages were available that performed coenocline simulation. coenocliner was designed to fill this gap, and today, the package was released to CRAN.

coenocliner can simulate species abundance or occurrence data along one or two gradients from either a Gaussian or generalised beta response model. Parameters for the response model are supplied for each species and parameterised species repsonse curves along the gradients are returned. Simulated abundance or occurrence data can be produced by sampling from one of several error distributions which use the parameterised species response curves as the expected count or probability of occurrence for the chosen error distribution. The available error distributions are

Poisson
Negative binomial
Bernoulli (occurrence; Binomial with denominator (m = 1))
Binomial (counts with specified denominator (m))
Beta-binomial
Zero-inflated Poisson (ZIP)
Zero-inflated negative binomial (ZINB)

You can find the source code on github and report any bugs or issues there. In the remainder of this posting I give an overview of coenocliner and show three examples illustrating features of package.

Introduction to coenocliner

To begin, load coenocliner and check the start-up message to see if you are using the current (0.1-0) release of the package

library("coenocliner")

The main function in coenocliner is coenocline(), which provides a relatively simple interface to coenocline simulation allowing flexible specification of gradient locations and response model parameters for species. Gradient locations are specified via argument x, which can be a single vector, or, in the case of two gradients, a matrix or a list containing vectors of gradient values. The matrix version assumes the first gradient’s values are in the first column and those for the second gradient in the second column

xy <- cbind(x = seq(from = 4, to = 7, length.out = 100),
            y = seq(from = 1, to = 100, length.out = 100))

Similarly, for the list version, the first component contains the values for the first gradient and the second component the values for the second gradient

xy <- list(x = seq(from = 4, to = 6, length.out = 100),
           y = seq(from = 1, to = 100, length.out = 100))

The species response model used is indicated via the responseModel argument; available options are “gaussian” and “beta” for the classic Gaussian response model and the generalise beta response model respectively. Parameters are supplied to coenocline() via the params argument. showParams() can be used list the parameters for the desired response model. The parameters for the Gaussian response model are

showParams("gaussian")

Species response model: Gaussian

Parameters:
[1] opt tol h* 

    Parameters marked with '*' are only supplied once

As indicated, some parameters are only supplied once per species, regardless of whether there are one or two gradients. Hence for the Gaussian model, the parameter h is only supplied for the first gradient even if two gradients are required.

Parameters are supplied as a matrix with named columns, or as a list with named components. For example, for a Gaussian response for each of 3 species we could use either of the two forms

opt <- c(4,5,6)
tol <- rep(0.25, 3)
h <- c(10,20,30)
parm <- cbind(opt = opt, tol = tol, h = h)     # matrix form
parl <- list(opt = opt, tol = tol, h = h)      # list form

In the case of two gradients, a list with two components, one per gradient, is required. The first component contains parameters for the first gradient, the second element contains those for the second gradient. These components can be either a matrix or a list, as described previously. For example a list with parameters supplied as matrices

opty <- c(25, 50, 75)
tol <- c(5, 10, 20)
pars <- list(px = parm,
             py = cbind(opt = opty, tol = tol))

Note that parameter (h) is not specified in the second set as this parameter, the height of the response curve at the gradient optima, applies globally — in the case of two gradients, (h) refers to the height of the bell-shaped curve at the bivariate optimum.

Notice also how parameters are specified at the species level. To evaluate the response curve at the supplied gradient locations each set of parameters needs to be repeated for each gradient location. Thankfully coenocline() takes care of this detail for us.

Additional parameters that may be needed for the response model but which are not specified at the species level are supplied as a list with named components to argument extraParams. An example is the correlation between Gaussian response curves in case of two gradients. This, unfortunately, means that a single correlation between response curves applies to all species¹, and is caused by a poor choice of implementation. Thankfully this is relatively easy to fix, which will be done for version 0.2-0 along with a fix for a similar issue relating to the statement of additional parameters for the error distribution used (see below).

To simulate realistic count data we need to sample with error from the parameterised species response curves. Which of the distributions (listed earlier) is used is specified via argument countModel; available options are

[1] "poisson"      "negbin"       "bernoulli"    "binary"      
[5] "binomial"     "betabinomial" "ZIP"          "ZINB"

Some of these distributions (all bar “poisson” and “bernoulli”) require additional arguments, such as the () parameter for (one parameterisation of) the negative binomial distribution. These arguments are supplied as a list with named components. Again, due to the same implementation snafu as for extraParams, such parameters act globally for all species².

The final argument is expectation, which defaults to FALSE. When set to TRUE, simulating species counts or occurrences with error is skipped and the values of the parameterised response curve evaluated at the gradient locations are returned. This option is handy if you want to look at or plot the species response curves used in a simulation.

Example usage

In the next few sections the basic usage of coenocline() is illustrated.

Gaussian responses along a single gradient

This example, of multiple species responses along a single environmental gradient, illustrates the simplest usage of coenocline(). The example uses a hypothetical pH gradient with species optima drawn at random uniformally along the gradient. Species tolerances are the same for all species. The maximum abundance of each species, (h), is drawn from a lognormal distribution with a mean of ~20 ((e^3)). This simulation will be for a community of 20 species, evaluated at 100 equally spaced locations. First we set up the parameters

set.seed(2)
M <- 20                                    # number of species
ming <- 3.5                                # gradient minimum...
maxg <- 7                                  # ...and maximum
locs <- seq(ming, maxg, length = 100)      # gradient locations
opt  <- runif(M, min = ming, max = maxg)   # species optima
tol  <- rep(0.25, M)                       # species tolerances
h    <- ceiling(rlnorm(M, meanlog = 3))    # max abundances
pars <- cbind(opt = opt, tol = tol, h = h) # put in a matrix

As a check, before simulating any count data, we can look at the coenocline implied by these parameters by returning the expectations only from coenocline()

mu <- coenocline(locs, responseModel = "gaussian", params = pars,
                 expectation = TRUE)

This returns a matrix of values obtained by evaluating each species response curve at the supplied gradient locations. There is one column per species and one row per gradient location

class(mu)
dim(mu)
head(mu[, 1:6])

[1] "matrix"

[1] 100  20

      [,1]      [,2]      [,3]   [,4]      [,5]      [,6]
[1,] 1.088 5.443e-20 1.433e-13 0.5025 1.461e-36 2.604e-38
[2,] 1.553 2.165e-19 4.414e-13 0.6938 9.370e-36 1.669e-37
[3,] 2.173 8.440e-19 1.333e-12 0.9391 5.892e-35 1.049e-36
[4,] 2.981 3.225e-18 3.945e-12 1.2460 3.631e-34 6.460e-36
[5,] 4.008 1.208e-17 1.144e-11 1.6203 2.194e-33 3.900e-35
[6,] 5.282 4.435e-17 3.254e-11 2.0655 1.299e-32 2.308e-34

A quick way to visualise the parameterised species response is to use matplot()³

matplot(locs, mu, lty = "solid", type = "l", xlab = "pH", ylab = "Abundance")

Figure 1: Gaussian species response curves along a hypothetical pH gradient

The resultant plot is shown in Figure 1.

As this looks OK, we can simulate some count data. The simplest model for doing so is to make random draws from a Poisson distribution with the mean, (), for each species set to value of the response curve evaluated at each gradient location. Hence the values in mu that we just created can be thought of as the expected count per species at each of the gradient locations we are interested in. To simulate Poisson count data, use expectation = FALSE or remove this argument from the call. To be more explicit, we should also state countModel = “poisson”⁴.

simp <- coenocline(locs, responseModel = "gaussian", params = pars,
                   countModel = "poisson")

Again, matplot is useful in visualizing the simulated data

matplot(locs, simp, lty = "solid", type = "p", pch = 1:10, cex = 0.8,
        xlab = "pH", ylab = "Abundance")

Figure 2: Simulated species abundances with Poisson errors from Gaussian response curves along a hypothetical pH gradient

The resultant plot is shown in Figure 2 above.

Whilst the simulated counts look reasonable and follow the response curves in Figure there is a problem; the variation around the expected curves is too small. This is due to the error variance implied by the Poisson distribution encapsulating only that variance which would arise due to repeated sampling at the gradient locations. Most species abundance data exhibit much larger degrees of variation than that shown in Figure .

A solution to this is to sample from a distribution that incorporates additional variance or overdispersion. A natural partner to the Poisson that includes overdispersion is the negative binomial. To simulate count data using the negative binomial distribution we must alter countModel and supply the overdispersion parameter () to use⁵ via countParams.

simnb <- coenocline(locs, responseModel = "gaussian", params = pars,
                    countModel = "negbin", countParams = list(alpha = 0.5))

Using matplot it is apparent that the simluated species data are now far more relalistic (Figure 3)

matplot(locs, simnb, lty = "solid", type = "p", pch = 1:10, cex = 0.8,
        xlab = "pH", ylab = "Abundance")

Figure 3: Simulated species abundance with negative binomial errors from Gaussian response curves along a hypothetical pH gradient

Generalised beta responses along a single gradient

In this example, I recreate figure 2 in Minchin (1987) and then simulate species abundances from the species response curves. The species parameters for the generalised beta response for the six species in Minchin (1987) are

A0    <- c(5,4,7,5,9,8) * 10               # max abundance
m     <- c(25,85,10,60,45,60)              # location on gradient of modal abundance
r     <- c(3,3,4,4,6,5) * 10               # species range of occurence on gradient
alpha <- c(0.1,1,2,4,1.5,1)                # shape parameter
gamma <- c(0.1,1,2,4,0.5,4)                # shape parameter
locs  <- 1:100                             # gradient locations
pars  <- list(m = m, r = r, alpha = alpha,
              gamma = gamma, A0 = A0)      # species parameters, in list form

To recreate figure 2 in Minchin (1987) evaluations at the chosen gradient locations, locs, of the parameterised generalised beta are required and can be generated by passing coenocline() the gradient locations and the chosen species parameters as before, choosing the generalised beta response model and using expectation = TRUE

mu <- coenocline(locs, responseModel = "beta", params = pars, expectation = TRUE)

As before mu is a matrix with one column per species

head(mu)

     [,1] [,2]  [,3] [,4]   [,5] [,6]
[1,]    0    0 44.52    0 0.5913    0
[2,]    0    0 49.39    0 1.6582    0
[3,]    0    0 53.90    0 3.0199    0
[4,]    0    0 57.97    0 4.6085    0
[5,]    0    0 61.52    0 6.3828    0
[6,]    0    0 64.51    0 8.3138    0

and as such we can use matplot() to draw the species responses

matplot(locs, mu, lty = "solid", type = "l", xlab = "Gradient", ylab = "Abundance")

Figure 4: Generalised beta function species response curves along a hypothetical environmental gradient recreating Figure 2 in Minchin (1987).

Figure 4 is a good facsimile of figure 2 in Minchin (1987).

Gaussian response along two gradients

In this example I illustrate how to simulate species abundance in an environment comprising two gradients. Parameters for the simulation are defined first, including the number of species and samples required, followed by definitions of the gradient units and lengths, species optima, and tolerances for each gradient, and the maximal abundance ((h)).

set.seed(10)
N <- 30                                           # number of samples
M <- 20                                           # number of species
## First gradient
ming1 <- 3.5                                      # 1st gradient minimum...
maxg1 <- 7                                        # ...and maximum
loc1 <- seq(ming1, maxg1, length = N)             # 1st gradient locations
opt1 <- runif(M, min = ming1, max = maxg1)        # species optima
tol1 <- rep(0.5, M)                               # species tolerances
h    <- ceiling(rlnorm(M, meanlog = 3))           # max abundances
par1 <- cbind(opt = opt1, tol = tol1, h = h)      # put in a matrix
## Second gradient
ming2 <- 1                                        # 2nd gradient minimum...
maxg2 <- 100                                      # ...and maximum
loc2 <- seq(ming2, maxg2, length = N)             # 2nd gradient locations
opt2 <- runif(M, min = ming2, max = maxg2)        # species optima
tol2 <- ceiling(runif(M, min = 5, max = 50))      # species tolerances
par2 <- cbind(opt = opt2, tol = tol2)             # put in a matrix
## Last steps...
pars <- list(px = par1, py = par2)                # put parameters into a list
locs <- expand.grid(x = loc1, y = loc2)           # put gradient locations together

Notice how the parameter sets for each gradient are individual matrices which are combined in a list, pars, ready for use. Also different this time is the expand.grid() call which is used to generate all pairwise combinations of the locations on the two gradients. This has the effect of creating a coordinate pair on the two gradients at which we’ll evaluate the response curves. In effect this creates a grid of points over the gradient space.

Having set up the parameters, the call to coenocline() is the same as before, except now we specify a degree of correlation between the two gradients via extraParams = list(corr = 0.5)

mu2d <- coenocline(locs, responseModel = "gaussian",
                   params = pars, extraParams = list(corr = 0.5),
                   expectation = TRUE)

mu2d now contains a matrix of expected species abundances, one column per species as before. Because of the way the expand.grid() function works, the ordering of species abudances in each column has the first gradient locations varying fastest — the locations on the first gradient are repeated in order for each location on the second gradient

head(locs)

As a result, we can reshape the abundances for a single species into a matrix reflecting the grid of locations over the gradient space via a simple matrix() call, setting the number of columns in the resultant matrix equal to the number of gradient locations in the simulation. By way of illustration, this approach is used to prepare the expected abundances for four of the species in mu2d for plotting via the persp() plotting function

layout(matrix(1:4, ncol = 2))
op <- par(mar = rep(1, 4))
for (i in c(2,8,13,19)) {
    persp(loc1, loc2, matrix(mu2d[, i], ncol = length(loc2)),
          ticktype = "detailed", zlab = "Abundance",
          theta = 45, phi = 30)
}
par(op)
layout(1)

Figure 5: Bivariate Gaussian species responses for four selected species.

The selected species response curves are shown in Figure 5.

Simulated counts for each species can be produced by removing expectation = TRUE from the call and choosing an error distribution to make random draws from. For example, for negative binomial errors with dispersion (= 1) we can use

sim2d <- coenocline(locs, responseModel = "gaussian",
                    params = pars, extraParams = list(corr = 0.5),
                    countModel = "negbin", countParams = list(alpha = 1))

The resulting simulated counts for the same four selected species are shown in Figure 6, which was generated using the code below

layout(matrix(1:4, ncol = 2))
op <- par(mar = rep(1, 4))
for (i in c(2,8,13,19)) {
    persp(loc1, loc2, matrix(sim2d[, i], ncol = length(loc2)),
          ticktype = "detailed", zlab = "Abundance",
          theta = 45, phi = 30)
}
par(op)
layout(1)

Figure 6: Simulated counts using negative binomial errors from bivariate Gaussian species responses for four selected species.

Allaby, M. (1998). A dictionary of ecology. second. Oxford University Press.

Minchin, P. R. (1987). Simulation of multidimensional community patterns: Towards a comprehensive model. Vegetatio 71, 145–156.

This is not strictly true as you can work out how the species parameters are replicated relative to gradient values and hence pass a vector of the correct length with the species-specific values included. Study the outputs from expand() when supplied gradient locations and parameters to work out how to specify extraParams appropriately↩
Again, this is not strictly true as you can work out how the species parameters are replicated relative to gradient values and hence pass a vector of the correct length with the species-specific values included. Study the outputs from expand() when supplied gradient locations and parameters to work out how to specify countParams appropriately↩
until such a time as the coenocliner has a plot method…↩
countModel = “poisson” is the default so this can be excluded from the call.↩
Recall that this is only easily specifiable globally in version 0.1-0 of coenocliner.↩

Confidence intervals for derivatives of splines in GAMs

2014-06-16T00:00:00+02:00

Last time out I looked at one of the complications of time series modelling with smoothers; you have a non-linear trend which may be statistically significant but it may not be increasing or decreasing everywhere. How do we identify where in the series the data are changing? In that post I explained how we can use the first derivatives of the model splines for this purpose, and used the method of finite differences to estimate them. To assess statistical significance of the derivative (the rate of change) I relied upon asymptotic normality and the usual pointwise confidence interval. That interval is fine if looking at just one point on the spline (not of much practical use), but when considering more points at once we have a multiple comparisons issue. Instead, a simultaneous interval is required, and for that we need to revisit a technique I blogged about a few years ago; posterior simulation from the fitted GAM.

To get a headstart on this, I’ll reuse the model we fitted to the CET time series from the previous post. Just copy and paste the code below into your R session

## Load the CET data and process as per other blog post
tmpf <- tempfile()
download.file("https://gist.github.com/gavinsimpson/b52f6d375f57d539818b/raw/2978362d97ee5cc9e7696d2f36f94762554eefdf/load-process-cet-monthly.R",
              tmpf, method = "wget")
source(tmpf)
## Load mgcv and fit the model
require("mgcv")
ctrl <- list(niterEM = 0, msVerbose = TRUE, optimMethod="L-BFGS-B")
m2 <- gamm(Temperature ~ s(nMonth, bs = "cc", k = 12) + s(Time, k = 20),
           data = cet, correlation = corARMA(form = ~ 1|Year, p = 2),
           control = ctrl)
## prediction data
want <- seq(1, nrow(cet), length.out = 200)
pdat <- with(cet,
             data.frame(Time = Time[want], Date = Date[want],
                        nMonth = nMonth[want]))

Here, I’ll use a version of the Deriv() function used in the last post modified to do the posterior simulation; derivSimulCI(). Let’s load that too

## download the derivatives gist
tmpf <- tempfile()
download.file("https://gist.githubusercontent.com/gavinsimpson/ca18c9c789ef5237dbc6/raw/295fc5cf7366c831ab166efaee42093a80622fa8/derivSimulCI.R",
              tmpf, method = "wget")
source(tmpf)

Posterior simulation

The sorts of GAMs fitted by mgcv::gam() are, if we assume normally distributed errors, really just a linear regression. Instead of being a linear model in the original data however, the linear model is fitted using the basis functions as the covariates¹. As with any other linear model, we get back from it the point estimate, the ( _j ), and their standard errors. Consider the simple linear regression of x on y. Such a model has two terms

the constant term (the model intercept), and
the effect on y of a unit change in x.

In fitting the model we get a point estimate for each term, plus their standard errors in the form of the variance-covariance (VCOV) matrix of the terms². Taken together, the point estimates of the model terms and the VCOV describe a multivariate normal distribution. In the case of the simple linear regression, this is a bivariate normal. Note that the point estimates are known as the mean vector of the multivariate normal; each point estimate is the mean, or expectation, of a single random normal variable whose variance is given by the standard error of the point estimate.

Computers are good at simulating data and you’ll most likely be familiar with rnorm() to generate random, normally distributed values from a distribution with mean 0 and unit standard deviation. Well, simulating from a multivariate normal is just as simple³, as long as you have the mean vector and the variance covariance matrix of the parameters.

Returning to the simple linear regression case, let’s do a little simulation from a known model and look at the multivariate normal distribution of the model parameters.

> set.seed(1)
> N <- 100
> dat <- data.frame(x = runif(N, min = 1, max = 20))
> dat <- transform(dat, y = 3 + (1.45 * x) + rnorm(N, mean = 2, sd = 3))
> ## sort dat on x to make things easier later
> dat <- dat[order(dat$x), ]
> mod <- lm(y ~ x, data = dat)

The mean vector for the multivariate normal is just the set of model coefficients for mod, which are extracted using the coef() function, and the vcov() function is used to extract the VCOV of the fitted model.

> coef(mod)

(Intercept)           x 
   4.412706    1.499317

> (vc <- vcov(mod))

            (Intercept)            x
(Intercept)  0.44563330 -0.033760188
x           -0.03376019  0.003114669

Remember, the standard error is the square root of the diagonal elements of the VCOV

> coef(summary(mod))[, "Std. Error"]

(Intercept)           x 
 0.66755771  0.05580922

> sqrt(diag(vc))

(Intercept)           x 
 0.66755771  0.05580922

The multivariate normal distribution is not part of the base R distributions set. Several implementations are available in a range of packages, but here I’ll use the one in the MASS package which ships with all versions of R. To draw a nice plot, I’ll simulate a large number of values but we’ll just show the first few below

> require("MASS")
> set.seed(10)
> nsim <- 5000
> sim <- mvrnorm(nsim, mu = coef(mod), Sigma = vc)
> head(sim)

     (Intercept)        x
[1,]    4.398392 1.476528
[2,]    4.536195 1.496449
[3,]    5.327903 1.426689
[4,]    4.810953 1.446152
[5,]    4.215752 1.509888
[6,]    4.153201 1.528336

Each row of sim contains a pair of values, one intercept and one ( _x ), from the implied multivariate normal. The models implied by each row are all consistent with the fitted model. To visualize the multivariate normal for mod I’ll use a bivariate kernel density estimate to estimate the density of points over a grid of simulated intercept and slope values

> kde <- kde2d(sim[,1], sim[,2], n = 75)
> plot(sim, pch = 19, cex = 0.5, col = "darkgrey")
> contour(kde$x, kde$y, kde$z, add = TRUE, col = "red", lwd = 2, drawlabels = FALSE)

5000 random draws from the posterior distribution of the parameters of the fitted additive model. Contours are for a 2d kernel destiny estimate of the points.

The large spread in the points (from top left to bottom right) is illustrative of greater uncertainty in the intercept term than in ( _x ).

As I said earlier, each point on the plot represents a valid model consistent with the estimates we achieved for the sample of data used to fit the model. If we were to multiple the second column of sim with the observed data and add on the first column of sim, we’d obtain fitted values for the observed x values for 5000 simulations from the fitted model as shown in the plot below

> plot(y ~ x, data = dat)
> set.seed(42)
> take <- sample(nrow(sim), 50) ## take 50 simulations at random
> fits <- cbind(1, dat$x) %*% t(sim[take, ])
> matlines(dat$x, fits, col = "#A9A9A97D", lty = "solid", lwd = 2)
> abline(mod, col = "red", lwd = 1)
> matlines(dat$x, predict(mod, interval = "confidence")[,-1], col = "red", lty = "dashed")

Fitted linear model and 50 posterior simulations (grey band) and 95% point-wise confidence interval (red dashes)

The grey lines show the model fits for a random sample of 50 pairs of coefficients from the set of simulated values.

Posterior simulation for additive models

You’ll be pleased to know that there is very little difference (non really) between what I just went through above for a simple linear regression and what is required to simulate from the posterior distribution of a GAM. However, instead of dealing with two or just a few regression coefficients, we now have to concern ourselves with the potentially larger number of coefficients corresponding to the basis functions that combine to form the fitted splines. The only practical difference is that instead of multiplying each simulation by the observed data⁴ with mgcv we generate the linear predictor matrix for the observations and multiply that by the model coefficients to get simulations. If you’ve read the previous post you should be somewhat familiar with the lpmatrix now.

Before we get to posterior simulations for the derivatives of the CET additive model fitted earlier, let’s look at some simulations for the trend term in that model, m2. If you look back at an earlier code block, I created a grid of 200 points over the range of the data which we’ll use to evaluate properties of the fitted model. This is in object pdat. First we generate the linear predictor matrix using predict() and grab the model coefficients and the variance covariance matrix of the coefficients

> lp <- predict(m2$gam, newdata = pdat, type = "lpmatrix")
> coefs <- coef(m2$gam)
> vc <- vcov(m2$gam)

Next, generate a small sample from the posterior of the model, just for the purposes of illustration; we’ll generate far larger samples later when we estimate a confidence interval on the derivatives of the trend spline.

> set.seed(35)
> sim <- mvrnorm(25, mu = coefs, Sigma = vc)

The linear predictor matrix, lp, has a column for every basis function, plus the constant term, in the model, but because the model is additive we can ignore the columns relating to the nMonth spline and the constant term and just work with the coefficients and columns of lp that pertain to the trend spline. Let’s identify those

> want <- grep("Time", colnames(lp))

Again, a simple bit of matrix multiplication gets us fitted values for the trend spline only

> fits <- lp[, want] %*% t(sim[, want])
> dim(fits) ## 25 columns, 1 per simulation, 200 rows, 1 per evaln point

[1] 200  25

We can now draw out each of these posterior simulations as follows

> ylims <- range(fits)
> plot(Temperature ~ Date, data = cet, pch = 19, ylim = ylims, type = "n")
> matlines(pdat$Date, fits, col = "black", lty = "solid")

Posterior simulations for the trend spline of the additive model fitted to the CET time series

Posterior simulation for the first derivatives of a spline

As we saw in the previous post, the linear predictor matrix can be used to generate finite differences-based estimates of the derivatives of a spline in a GAM fitted by mgcv. And as we just went through, we can combine posterior simulations with the linear predictor matrix. The main steps in the process of computing the finite differences and doing the posterior simulation are

X0 <- predict(mod, newDF, type = "lpmatrix")
newDF <- newDF + eps
X1 <- predict(mod, newDF, type = "lpmatrix")
Xp <- (X1 - X0) / eps

where two linear predictor matrices are created, offset from one another by a small amount eps, and differenced to get the slope of the spline, and

for(i in seq_len(nt)) {
  Xi <- Xp * 0
  want <- grep(t.labs[i], colnames(X1))
  Xi[, want] <- Xp[, want]
  df <- Xi %*% t(simu[, want])    # derivatives
}

which loops over the terms in the model, selects the relevant columns from the differenced predictor matrix, and computes the derivatives by a matrix multiplication with the set of posterior simulations. simu is the matrix of random draws from the posterior, multivariate normal distribution of the fitted model’s parameters. Note that the code in derivSimulCI() is slightly different to this, but it does the same thing.

To cut to the chase then, here is the code required to generate posterior simulations for the first derivatives of the spline terms in an additive model

> fd <- derivSimulCI(m2, samples = 10000)

fd is a list, the first n terms of which relate the the n terms in the model. Here n = 2. The names of the first two components are the names of the terms referenced in the model formula used to fit the model

> str(fd, max = 1)

List of 5
 $ nMonth  :List of 2
 $ Time    :List of 2
 $ gamModel:List of 31
  ..- attr(*, "class")= chr "gam"
 $ eps     : num 1e-07
 $ eval    : num [1:200, 1:2] 1 1.06 1.11 1.17 1.22 ...
  ..- attr(*, "dimnames")=List of 2
 - attr(*, "class")= chr "derivSimulCI"

As I haven’t yet written a confint() method, we’ll need to compute the confidence interval by hand, which is no bad thing of course! We do this by by taking two extreme quantiles of the distribution of the 10,000 posterior simulations we generated for the first derivative at each of the 200 points we wanted to evaluate the derivative. One of the reasons I did 10,000 simulations is that for a 95% confidence interval we only need sort the simulated derivatives in ascending order and extract the 250th and the 9750th of these ordered values. In practice we’ll let the quantile() function do the hard work

> CI <- lapply(fd[1:2],
+              function(x) apply(x$simulations, 1, quantile,
+                                probs = c(0.025, 0.975)))

CI is now a list with two components, each of which contains a matrix with two rows (the two probability quantiles we asked for) and 200 columns (the number of locations at which the first derivative was evaluated).

There is a plot() method, which by default produces plots of all the terms in the model and includes the ~~simultaneous~~ point-wise confidence interval as well

> plot(fd, sizer = TRUE)

First derivative of the seasonal and trend splines from the CET time series additive model. The grey band is a 95% ~~simultaneous~~ point-wise confidence interval. Sections of the spline where the confidence interval does not include zero are indicated by coloured sections.

Wrapping up

derivSimulCI() computes the actual derivative as well as the derivatives for each simulation. Rather than rely upon the plot() method we could draw our own plot with the confidence interval. To extract the derivative of the fitted spline use

> fit.fd <- fd[[2]]$deriv

and then produce a plot with the actual derivative, the 95% ~~simultaneous~~ point-wise confidence interval, and 20 of the derivatives for the posterior simulations, we can use

> set.seed(76)
> take <- sample(nrow(fd[[2]]$simulations), 20)
> plot(pdat$Date, fit.fd, type = "l", ylim = range(CI[[2]]), lwd = 2)
> matlines(pdat$Date, t(CI[[2]]), lty = "dashed", col = "red")
> matlines(pdat$Date, fd[[2]]$simulations[, take], lty = "solid",
+          col = "grey")

First derivative of the trend spline from the CET time series additive model. The red dashed lines enclose the 95% ~~simultaneous~~ point-wise confidence interval. Superimposed are the first derivatives of the splines for 20 randomly selected posterior simulations from the fitted spline.

It is a little bit more complex than this, of course. If you allow gam() to select the degree of smoothness then you need to fit a penalized regression. Plus, the time series models fitted to the CET data aren’t fitted via gam() but via gamm(), where we are using the observation that a penalized regression can be expressed as a linear mixed model, with random effects being used to represent some of the penalty terms. If you specify the degree of smoothing to use, these complications go away.↩
The (squares of the) standard errors are on the diagonal of the VCOV, with the relationship between pairs of parameters being contained in the off-diagonal elements.↩
in practice. I suspect it is not quite so simple if one had to sit down and implement it…↩
or a set of new values at which you want to evaluate the confidence interval↩

From the bottom of the heap

gratia 0.9.0

Breaking changes

New features

Defunct and deprecated

Fin

Using random effects in GAMs with mgcv

Smooths as random effects

Fitting random effects with mgcv

It’s not all good news

Fin

References

Getting data from the Canada Covid-19 Tracker using R

Covid-19 cases per day

References

Two new versions of gratia released

Evaluating smooths with smooth_estimates()

Partial residuals

Penalty matrices

Colour scales

constant and fun

Excluding or selecting terms to include in model predictions

Summary

Extrapolating with B splines and GAMs

Thin Plate splines

B splines

Comparing different bases

More with B splines

Multiple penalties

References

gratia 0.4.1 released

Partial residuals

Simulating data

Difference smooths

Fitted values and residuals utility functions

Other changes

Rendering your README with GitHub Actions

What evaluating Discovery Grants for the last three years has taught me

How we practically assess Discovery Grants

What makes a good Discovery Grant?

Excellence of the Researcher

Merit of the Proposal

Training of Highly Qualified Personnel

Early career researchers and HQP

Random stuff

Final thoughts

Pivoting tidily

Pivoting

radian: a modern console for R

Tibbles, checking examples, & character encodings

What's wrong with software paper preprints on EarthArXiv?

References

Confidence intervals for GLMs

Why is plus/minus two standard errors wrong?

Confidence intervals the right way

Introducing gratia

Plotting smooths

Producing diagnostic plots

What else can gratia do?

What can’t gratia do?

The future?

References

Controls on subannual variation in pCO2 in productive hardwater lakes

References

Summer hiatus

Fitting GAMs with brms: part 1

References

Comparing smooths in factor-smooth interactions II

References

First steps with MRF smooths

Comparing smooths in factor-smooth interactions I

Differences of smooths

Conclusions

References

Fitting count and zero-inflated count GLMMs with mgcv

Salamanders

Owls

Conclusions

References

Prediction intervals for GLMs part II

Evaluating smooths with `smooth_estimates()`

`constant` and `fun`

Controls on subannual variation in pCO₂ in productive hardwater lakes