From the bottom of the heap
2024-03-01T04:28:05-06:00
https://www.fromthebottomoftheheap.net/
Gavin L. Simpson
ucfagls@gmail.com
https://www.fromthebottomoftheheap.net/about.html
Copyright 2010–2024 Gavin L. Simpson. Available under Creative Commons CC-BY licence
Using random effects in GAMs with mgcv
Gavin L. Simpson
2021-02-02T10:50:00-06:00
2021-02-02T10:50:00-06:00
https://www.fromthebottomoftheheap.net/2021/02/02/random-effects-in-gams/
<p>
There are lots of choices for fitting generalized linear mixed effects models within R, but if you want to include smooth functions of covariates, the choices are limited. One option is to fit the model using <code>gamm()</code> from the <strong>mgcv</strong> š¦ or <code>gamm4()</code> from the <strong>gamm4</strong> š¦, which use <code>lme()</code> (<strong>nlme</strong> š¦) or one of <code>lmer()</code> or <code>glmer()</code> (<strong>lme4</strong> š¦) under the hood respectively. The problem with doing things that way is that you get PQL fitting for non-Gaussian models (š±) and the range of families for handling non-Gaussian responses is quite limited, especially compared with the extended families now available with <code>gam()</code>. <strong>brms</strong> š¦ is a good option if you donāt want to do everything by hand, but the MCMC can be slow. Instead, we could use the equivalence between smooths and random effects and use <code>gam()</code> or <code>bam()</code> from <strong>mgcv</strong>. In this post Iāll show you how to do just that.
</p>
<p>
There are lots of choices for fitting generalized linear mixed effects models within R, but if you want to include smooth functions of covariates, the choices are limited. One option is to fit the model using <code>gamm()</code> from the <strong>mgcv</strong> š¦ or <code>gamm4()</code> from the <strong>gamm4</strong> š¦, which use <code>lme()</code> (<strong>nlme</strong> š¦) or one of <code>lmer()</code> or <code>glmer()</code> (<strong>lme4</strong> š¦) under the hood respectively. The problem with doing things that way is that you get PQL fitting for non-Gaussian models (š±) and the range of families for handling non-Gaussian responses is quite limited, especially compared with the extended families now available with <code>gam()</code>. <strong>brms</strong> š¦ is a good option if you donāt want to do everything by hand, but the MCMC can be slow. Instead, we could use the equivalence between smooths and random effects and use <code>gam()</code> or <code>bam()</code> from <strong>mgcv</strong>. In this post Iāll show you how to do just that.
</p>
<h2 id="smooths-as-random-effects">
Smooths as random effects
</h2>
<p>
The sorts of smooths we fit in <strong>mgcv</strong> are (typically) penalized smooths; we choose to use some number of basis functions <span class="math inline">(k)</span>, which sets an upper limit on the complexity ā wiggliness ā of the smooth, and then we estimate parameters for the model by maximizing a penalized log-likelihood. The log-likelihood of the model is a measure of the fit (or lack there of), while the penalty helps us avoid fitting overly complex smooths.
</p>
<p>
In the sorts of models that can be fitted in <strong>mgcv</strong>, the penalty is a function of the model coefficients, <span class="math inline">()</span>, and a penalty matrix<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>, which we write as <span class="math inline">()</span>. The penalty then is <span class="math inline">(^{} )</span>. The penalty matrix measures the wiggliness of each basis function (on the diagonal), and how the wiggliness of one basis function affects the wiggliness of another (the off diagonals). Just as the <span class="math inline">()</span> scale the individual basis functions, they also scale penalty values in the penalty matrix; if you were to choose large weights for the most wiggly basis functions, the overall penalty <span class="math inline">(^{} )</span> would increase by a lot more than if we used smaller weights for those really wiggly functions.
</p>
<p>
The penalty then acts to shrink the estimates of <span class="math inline">()</span> away from the values they would take if we werenāt doing a penalized fit and were instead fixing the wiggliness of the smooth at the maximum value dictated by <span class="math inline">(k)</span>. Put another way, the penalty shrinks the estimates for <span class="math inline">()</span> towards zero.
</p>
<p>
Random effects also involve shrinkage. With a random effect weāre trying to model subject specific effects (subject-specific intercepts, or subject-specific āslopesā of covariates) without having to explicitly estimate a fixed effect parameter for each subjectās intercept or covariate effect. Instead we think of the subject-specific intercepts or āslopesā as coming from distribution, typically a Gaussian distribution, with mean 0 and some variance that is to be estimated. The larger this random effect variance, the greater the variation among subject-specific intercepts, āslopesā etc. The smaller the random effect variance, the closer to zero the estimated effects are pulled. As a result, random effects shrink to, varying degrees, the estimated subject-specific effects, and how much they do that is related to the random effect variance.
</p>
<p>
If I abuse all standards of notation and represent the estimated random effects with <span class="math inline">()</span>, you might get the feeling that perhaps there is some link between whats happening when we estimate random effects shrinking the <span class="math inline">()</span> towards zero, and the penalty applied to smooths that shrinks the <span class="math inline">()</span> towards zero. If you did, youād be right, there is. And if so, there must be a penalty matrix that we can write down for a random effect ā if we assume that each random intercept or āslopeā is a basis function, the penalty matrix <span class="math inline">()</span> is a simple diagonal matrix, one row and column per subject, with a constant value on the diagonal (and zeroes everywhere else):
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/random-effects-in-gams-penalty-example-1.png" alt="Penalty matrix corresponding to a random effect for a factor with 10 subjects (levels)." />
<figcaption>
Penalty matrix corresponding to a random effect for a factor with 10 subjects (levels).
</figcaption>
</figure>
<p>
To complete the picture, when we fit a GAM, weāre maximising the penalised log-likelihood over both the model parameters <span class="math inline">()</span> and a smoothness parameter, <span class="math inline">()</span>. Itās <span class="math inline">()</span> that actually controls how much price we pay for the wiggliness penalty as we add <span class="math inline">( ^{} )</span> to the log-likelihood. It turns out that the variance of the random effect is equal to the scale parameter (the residual variance <span class="math inline">(^{2}_{})</span> in a Gaussian model for example) divided by <span class="math inline">()</span>.
</p>
<p>
This link between smooths and random effects is really cool; not only are we able to estimate smooths and GAMs using the machinery of mixed effects models, we can also estimate random effects using all the penalized spline machinery available for GAMs in <strong>mgcv</strong>.
</p>
<p>
OK, so that was all really hand-wavy and skipped over a lot of math and theory<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>, but I hope it gives you the intuition you need to understand how random effects are represented as smooths, through the identity penalty matrix.
</p>
<h2 id="fitting-random-effects-with-mgcv">
Fitting random effects with mgcv
</h2>
<p>
So much for the theory, letās see how this all works in practice.
</p>
<p>
By way of an example, Iām going to use a data set from a study on the effects of testosterone on the growth of rats from <span class="citation" data-cites="Molenberghs2000-vk">Molenberghs and Verbeke (2000)</span>, which was analysed in <span class="citation" data-cites="Fahrmeir2013-xu">Fahrmeir et al.Ā (2013)</span>, from were I also obtained the data. In the experiment, 50 rats were randomly assigned to one of three groups; a control group or a group receiving low or high doses of Decapeptyl, which inhibits testosterone production. The experiment started when the rats were 45 days old and starting with the 50th day, the size of the ratās head was measured via an X-ray image. You can download the data <a href="https://www.uni-goettingen.de/de/551625.html">here</a>.
</p>
<p>
For the example, weāll use the following packages
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pkgs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">,</span><span class="w"> </span><span class="s2">"lme4"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ggplot2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"vroom"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dplyr"</span><span class="p">,</span><span class="w"> </span><span class="s2">"forcats"</span><span class="p">,</span><span class="w"> </span><span class="s2">"tidyr"</span><span class="p">)</span><span class="w">
</span><span class="c1">## install.packages(pkgs, Ncpus = 4)</span><span class="w">
</span><span class="n">vapply</span><span class="p">(</span><span class="n">pkgs</span><span class="p">,</span><span class="w"> </span><span class="n">library</span><span class="p">,</span><span class="w"> </span><span class="n">logical</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">character.only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">logical.return</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w">
</span><span class="n">quietly</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> mgcv lme4 ggplot2 vroom dplyr forcats tidyr
TRUE TRUE TRUE TRUE TRUE TRUE TRUE </code></pre>
</figure>
<p>
Weāll also need the development version of the <strong>gratia</strong> š¦, which we can install with the <strong>remotes</strong> š¦ (if you donāt have that installed, install it first)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## install.packages("remotes")</span><span class="w">
</span><span class="c1">## remotes::install_github('gavinsimpson/gratia')</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'gratia'</span><span class="p">)</span></code></pre>
</figure>
<p>
We load the data ā ignore the warning about new names as we deleted that column anyway
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">rats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vroom</span><span class="p">(</span><span class="s1">'rats.txt'</span><span class="p">,</span><span class="w"> </span><span class="n">delim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">' '</span><span class="p">,</span><span class="w"> </span><span class="n">col_types</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'dddddddddddd-'</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">New names:
* `` -> ...13</code></pre>
</figure>
<p>
Next we need to prepare the data for modelling. The variable <code>transf_time</code> is the main covariate of interest. It relates to the age of the rats in days via the transformation
</p>
<p>
<span class="math display">[(1 + ( - 45) / 10)]</span>
</p>
<p>
where <span class="math inline">()</span> is the <code>time</code> variable in the data set. We also need to convert the <code>group</code> variable to a factor with useful levels to create a <code>treatment</code> variable and we convert <code>subject</code> ā an identifier for each individual rat ā a factor
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">rats</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rats</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">treatment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_recode</span><span class="p">(</span><span class="n">factor</span><span class="p">(</span><span class="n">group</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">3</span><span class="p">)),</span><span class="w">
</span><span class="n">Low</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'1'</span><span class="p">,</span><span class="w">
</span><span class="n">High</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'3'</span><span class="p">,</span><span class="w">
</span><span class="n">Control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'2'</span><span class="p">),</span><span class="w">
</span><span class="n">subject</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">subject</span><span class="p">))</span></code></pre>
</figure>
<p>
The number of observations per rat is variable, with only 22 of the 50 rats having the complete seven measurements by day 110
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">rats</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">na.omit</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">subject</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n_rats"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 7 x 2
n n_rats
* <int> <int>
1 1 4
2 2 3
3 3 5
4 4 9
5 5 5
6 6 2
7 7 22</code></pre>
</figure>
<p>
so thereāll be no averaging the response within subjects and doing an ANOVA.
</p>
<p>
Before we fit the models an explore how to work with random effects in <strong>mgcv</strong>, weāll plot the data
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plt_labs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Head height (distance in pixels)'</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Age in days'</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Treatment'</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">rats</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">response</span><span class="p">,</span><span class="w">
</span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subject</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">treatment</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">plt_labs</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Removed 98 row(s) containing missing values (geom_path).</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/random-effects-in-gams-plot-rat-data-1.png" alt="Plot of the rat hormone therapy data" />
<figcaption>
Plot of the rat hormone therapy data
</figcaption>
</figure>
<p>
The model fitted in <span class="citation" data-cites="Fahrmeir2013-xu">Fahrmeir et al.Ā (2013)</span> is
</p>
<p>
<span class="math display">[y_{ij} = <em>0 + </em>{0i} + <em>1 L_i t</em>{ij} + <em>2 H_i t</em>{ij} + <em>3 C_i t</em>{ij} + <em>{1i} t</em>{ij} + _{ij}]</span>
</p>
<p>
where
</p>
<ul>
<li>
<span class="math inline">(_0)</span> is the population mean of the response at the start of the treatment
</li>
<li>
<span class="math inline">(L_i)</span>, <span class="math inline">(H_i)</span>, <span class="math inline">(C_i)</span> are dummy variables encoding for each treatment group
</li>
<li>
<span class="math inline">(_{0i})</span> is the rat-specific mean (random intercept)
</li>
<li>
<span class="math inline">(<em>{qi} t</em>{ij})</span> is the rat-specific effect of <code>transf_time</code> (random slope)
</li>
</ul>
<p>
If this isnāt very clear ā it took me a little while to grok what this meant and translate it to R speak ā note that each of <span class="math inline">(_1)</span>, <span class="math inline">(_2)</span>, and <span class="math inline">(_3)</span> are associated with an interaction between the dummy variable coding for the treatment and the time variable. So we have a model with an intercept and three interaction terms with no main effects.
</p>
<p>
In <code>lmer()</code> we can fit this model with (ignore the singular fit warning for now)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1_lmer</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lmer</span><span class="p">(</span><span class="n">response</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="o">:</span><span class="n">transf_time</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">subject</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">transf_time</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">subject</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rats</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">boundary (singular) fit: see ?isSingular</code></pre>
</figure>
<p>
If youāre not familiar with this model specification for the random effects, it specifies uncorrelated random effects for the subject-specific means (random intercept; <code>(1 | subject)</code>) and the subject_specific effects of <code>transf_time</code> (random slope; <code>(0 + transf_time | subject)</code>). The <code>0</code> in the formula for the latter suppresses the (random) intercept as we already included that as a separate term.
</p>
<p>
The reason weāre fitting uncorrelated random effects is because thatās all <strong>mgcv</strong> can fit; thereās no way to encode a covariance term between the two random effects.
</p>
<p>
The equivalent model fitted using <code>gam()</code> is
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1_gam</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">response</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="o">:</span><span class="n">transf_time</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">s</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'re'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">s</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span><span class="w"> </span><span class="n">transf_time</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'re'</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rats</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">)</span></code></pre>
</figure>
<p>
Note:
</p>
<ol type="1">
<li>
we specify two separate <em>random effect</em> smooths, one per random term,
</li>
<li>
we indicate that the smooth should be a random effect with <code>bs = āreā</code>,
</li>
<li>
any grouping variables <strong>must</strong> be coded as a factor ā thatās why we converted <code>subject</code> (which is an integer vector) to a factor right after importing the data.
</li>
</ol>
<p>
Letās compare the fixed effect terms; first for the <code>lmer()</code> version
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">fixef</span><span class="p">(</span><span class="n">m1_lmer</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> (Intercept) treatmentControl:transf_time
68.607386 6.871128
treatmentLow:transf_time treatmentHigh:transf_time
7.506897 7.313854 </code></pre>
</figure>
<p>
and for the <code>gam()</code> version
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">coef</span><span class="p">(</span><span class="n">m1_gam</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">]</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> (Intercept) treatmentControl:transf_time
68.607385 6.871130
treatmentLow:transf_time treatmentHigh:transf_time
7.506897 7.313859 </code></pre>
</figure>
<p>
which are close enough.
</p>
<p>
Next letās look at the estimated variances of the random effect terms. First for the <code>lmer()</code> model:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m1_lmer</span><span class="p">)</span><span class="o">$</span><span class="n">varcor</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Groups Name Std.Dev.
subject (Intercept) 1.8881
subject.1 transf_time 0.0000
Residual 1.2020 </code></pre>
</figure>
<p>
and now for the <code>gam()</code> model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">variance_comp</span><span class="p">(</span><span class="n">m1_gam</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 3 x 5
component variance std_dev lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl>
1 s(subject) 3.56 1.89 1.51e+ 0 2.36e 0
2 s(subject,transf_time) 0.0000257 0.00507 8.21e-42 3.14e36
3 scale 1.44 1.20 1.09e+ 0 1.33e 0</code></pre>
</figure>
<p>
Apart from being as close for the differences not to matter, we should also note that the variance for the rat-specific effect of <code>transf_time</code> is effectively 0. This is likely the cause of the singular fit warning from <code>lmer()</code>. The <code>lower_ci</code> and <code>upper_ci</code> variables indicate the limits of a 95% confidence interval on the standard deviation of each variance component; the coverage can be controlled via the <code>coverage</code> argument to <code>variance_comp()</code>. The confidence interval for the rat-specific time effect variance is huge, again indicating that there really isnāt much variation at all in this component.
</p>
<p>
Here we used the <code>variance_comp()</code> function from <strong>gratia</strong> to extract the variance components, which expresses the random effects as their equivalent variance components that youād see in a mixed model output. <code>variance_comp()</code> is a simple wrapper to <code>mgcv::gam.vcomp()</code>, which is doing all the hard work, but <code>variance_comp()</code> suppresses the annoying printed output produced by <code>gam.vcomp()</code> and returns the variance components as a tibble.
</p>
<p>
You can see a nicer version of the variance components for <code>lmer()</code> by printing the whole <code>summary()</code> but it produces a lot of output; the bit we are interested in just now is in the section labelled <em>Random effects:</em>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m1_lmer</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Linear mixed model fit by REML ['lmerMod']
Formula: response ~ treatment:transf_time + (1 | subject) + (0 + transf_time |
subject)
Data: rats
REML criterion at convergence: 932.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.25576 -0.65898 -0.01164 0.58358 2.88310
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 3.565 1.888
subject.1 transf_time 0.000 0.000
Residual 1.445 1.202
Number of obs: 252, groups: subject, 50
Fixed effects:
Estimate Std. Error t value
(Intercept) 68.6074 0.3312 207.13
treatmentControl:transf_time 6.8711 0.2276 30.19
treatmentLow:transf_time 7.5069 0.2252 33.34
treatmentHigh:transf_time 7.3139 0.2808 26.05
Correlation of Fixed Effects:
(Intr) trtC:_ trtL:_
trtmntCnt:_ -0.340
trtmntLw:t_ -0.351 0.119
trtmntHgh:_ -0.327 0.111 0.115
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular</code></pre>
</figure>
<p>
One of the nice things about the output from the <code>gam()</code> model is that the <code>summary()</code> contains a test for the random effects
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m1_gam</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
response ~ treatment:transf_time + s(subject, bs = "re") + s(subject,
transf_time, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.6074 0.3312 207.13 <2e-16 ***
treatmentControl:transf_time 6.8711 0.2276 30.19 <2e-16 ***
treatmentLow:transf_time 7.5069 0.2252 33.34 <2e-16 ***
treatmentHigh:transf_time 7.3139 0.2808 26.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(subject) 43.723610 49 11.51 <2e-16 ***
s(subject,transf_time) 0.001387 47 0.00 0.744
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.926 Deviance explained = 94%
-REML = 466.2 Scale est. = 1.4448 n = 252</code></pre>
</figure>
<p>
This test is due to <span class="citation" data-cites="Wood2013-gz">Wood (2013)</span>. It is based on a likelihood ratio test and uses a reference distribution that is appropriate for testing a null hypothesis that is on the boundary of the parameter space (the null, that the variance is 0, is on the lower boundary of possible values for the parameter ā you canāt have a negative variance!)
</p>
<p>
There is little evidence in support of the rat-specific time effects, reflecting what we saw when we looked at the variance components above.
</p>
<p>
If we look at the estimated degrees of freedom (EDF; the <code>edf</code> column) for each of the āsmoothsā we see the shrinkage in action. The <code>Ref.df</code> column contains the maximum degrees of freedom for each term, used in the calculation of the <em>p</em> value. The rat-specific mean distances ā the <code>s(subject)</code> term ā have only been shrunk a little to an EDF of ~43.7. In contrast, the EDF for the rat-specific effects of time has been shrunk to effectively zero.
</p>
<p>
The EDFs for smooths can be extracted from a fitted model with <code>edf()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">edf</span><span class="p">(</span><span class="n">m1_gam</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 2 x 2
smooth edf
<chr> <dbl>
1 s(subject) 43.7
2 s(subject,transf_time) 0.00139</code></pre>
</figure>
<p>
To plot the estimated time effects for each rat, we need to produce a new data frame with values of the range of <code>transf_time</code> for each rat, and include the relevant treatment value for the rat also. We do this with <code>expand()</code> and <code>nesting()</code> from the <strong>tidyr</strong> š¦.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">new_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tidyr</span><span class="o">::</span><span class="n">expand</span><span class="p">(</span><span class="n">rats</span><span class="p">,</span><span class="w"> </span><span class="n">nesting</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span><span class="w"> </span><span class="n">treatment</span><span class="p">),</span><span class="w">
</span><span class="n">transf_time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unique</span><span class="p">(</span><span class="n">transf_time</span><span class="p">))</span></code></pre>
</figure>
<p>
which we then use to predict from the model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1_pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w">
</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m1_gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w">
</span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)))</span></code></pre>
</figure>
<p>
which gives us something we can plot easily with <strong>ggplot</strong>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">m1_pred</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">transf_time</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subject</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">treatment</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">plt_labs</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/random-effects-in-gams-plot-m1-gam-predicted-1.png" alt="Fitted growth curves from the mixed effect model fitted using gam()" />
<figcaption>
Fitted growth curves from the mixed effect model fitted using <code>gam()</code>
</figcaption>
</figure>
<p>
We can also compare the fitted curves with the observed data
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">m1_pred</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">transf_time</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">subject</span><span class="p">,</span><span class="w">
</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">treatment</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rats</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">response</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">subject</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">plt_labs</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Removed 98 rows containing missing values (geom_point).</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/random-effects-in-gams-plot-m1-gam-compare-1.png" alt="Observed values and fitted growth curves from the mixed effect model fitted using gam()" />
<figcaption>
Observed values and fitted growth curves from the mixed effect model fitted using <code>gam()</code>
</figcaption>
</figure>
<p>
A simpler model, which drops the rat-specific effects of <code>transf_time</code> is
</p>
<p>
<span class="math display">[y_{ij} = <em>0 + </em>{0i} + <em>1 L_i t</em>{ij} + <em>2 H_i t</em>{ij} + <em>3 C_i t</em>{ij} + _{ij}]</span>
</p>
<p>
which drops the <span class="math inline">(<em>{qi} t</em>{ij})</span> term, excluding the rat-specific time effects from the model.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m2_lmer</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lmer</span><span class="p">(</span><span class="n">response</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="o">:</span><span class="n">transf_time</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">subject</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rats</span><span class="p">)</span><span class="w">
</span><span class="n">m2_gam</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">response</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="o">:</span><span class="n">transf_time</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">s</span><span class="p">(</span><span class="n">subject</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'re'</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rats</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">)</span></code></pre>
</figure>
<p>
As we should now expected, the two models have estimated variance components that are essentially equivalent. First for the <code>lmer()</code> fit:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m2_lmer</span><span class="p">)</span><span class="o">$</span><span class="n">varcor</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Groups Name Std.Dev.
subject (Intercept) 1.8881
Residual 1.2020 </code></pre>
</figure>
<p>
and now for the <code>gam()</code> version
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">variance_comp</span><span class="p">(</span><span class="n">m2_gam</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 2 x 5
component variance std_dev lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl>
1 s(subject) 3.56 1.89 1.51 2.36
2 scale 1.44 1.20 1.09 1.33</code></pre>
</figure>
<p>
We could use the <code>anova()</code> method for <code>āgamā</code> fits but for fully penalized terms like random effects, the test isnāt very good and <em>p</em> values can be badly biased. Wood (2017, p.Ā 315) says of the test āAs expected, the test is clearly useless for comparing models differing in [their] random effect structure.ā So, maybe give this one a miss.
</p>
<p>
Using <code>AIC()</code> to compare the models is also an option:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">AIC</span><span class="p">(</span><span class="n">m2_gam</span><span class="p">,</span><span class="w"> </span><span class="n">m1_gam</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> df AIC
m2_gam 48.98553 852.9313
m1_gam 48.98931 852.9371</code></pre>
</figure>
<p>
AIC clearly favours the simpler model as the fits of the two models are essentially the same. Note that because the EDF of the <code>s(subject, transf_time)</code> was so close to zero, we donāt pay much of a penalty for including this term in the model, and hence the AICs of the two models are very similar (typically weād expect that where two models have the same fit, the AIC for the more complex one would be the larger value).
</p>
<p>
Note that the AIC computed for the <code>gam()</code> model is a <em>conditional</em> AIC, where the likelihood is of all model coefficients set to their maximum penalized likelihood estimates. The AIC for an <code>lmer()</code> fit is a <em>marginal</em> AIC, where all the penalized coefficients are viewed as random effects and integrated out of the joint density of the response and random effects.
</p>
<p>
The conditional AIC for the <code>gam()</code> fit would be anti-conservative, especially so in the case of models containing random effects. The upshot of that is that the <em>conditional</em> AIC would typically choose a model with a random effects structure that isnāt in the true model if no steps were taking to account for smoothness parameter selection in the EDF calculation. The <code>AIC()</code> method for <code>gam()</code> fits applies a suitable correction to the model EDF to account for smoothness parameter selection, resulting in an information criterion that has mostly good properties.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">draw</span><span class="p">(</span><span class="n">m2_gam</span><span class="p">,</span><span class="w"> </span><span class="n">parametric</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/random-effects-in-gams-draw-gam-model-1.png" alt="QQ-plot of the rat-specific mean distance effects" />
<figcaption>
QQ-plot of the rat-specific mean distance effects
</figcaption>
</figure>
<p>
We need <code>parametric = FALSE</code> here because at the time of writing there is a bug in the code that handles parametric fixed effects.
</p>
<h2 id="its-not-all-good-news">
Itās not all good news
</h2>
<p>
It all seems a little too good to be true, doesnāt it! We have a way to fit models with random effects that works well, allows for tests of random effect terms against a null of 0 variance, and which allows us to use all the extended families that <code>gam()</code> allows including some complex distributional model families.
</p>
<p>
Well, as they say, there is no free lunch; the main issue with fitting random effects as smooths within <code>gam()</code> fits is to do with efficiency. <code>lmer()</code> and <code>glmer()</code> use very efficient algorithms for fitting the model, including the use of sparse matrices for the model terms. Because <code>gam()</code> fits need the full penalty matrix for each random effect, and <code>gam()</code> currently doesnāt use any sparse matrices for efficient computation, <code>gam()</code> fits are going to get very slow as the number of random effects increases: the larger the number of subjects (levels) the slower things will get. The same will happen the greater the number of complex random effect terms you include the model.
</p>
<p>
Basically, if you have random effects with many hundreds or thousands of levels (subjects), expect the time it takes to fit your <code>gam()</code> to increase dramatically, and expect the memory usage to increase markedly too.
</p>
<p>
Also, running <code>summary()</code> on a model with random effects with many levels or lots of random effects terms is also going to be slow: the test for the random effect terms is quite computationally expensive. If you are mostly interested in the other model terms, setting the <code>re.test</code> argument to <code>FALSE</code> will skip the tests for random effects (and other terms with zero dimension null space), allowing the summary for the other terms to be computed quickly.
</p>
<h2 id="fin">
Fin
</h2>
<p>
In this post I showed how random effects can be represented as smooths and how to use them practically in in <code>gam()</code> models. I hope you found it useful. If you have any comments or questions, let me know them in the comments below.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Fahrmeir2013-xu">
<p>
Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013). <em>Regression: Models, methods and applications</em>. Springer Berlin Heidelberg doi:<a href="https://doi.org/10.1007/978-3-642-34333-9">10.1007/978-3-642-34333-9</a>.
</p>
</div>
<div id="ref-Molenberghs2000-vk">
<p>
Molenberghs, G., and Verbeke, G. (2000). <em>Linear mixed models for longitudinal data</em>. Springer, New York, NY doi:<a href="https://doi.org/10.1007/978-1-4419-0300-6">10.1007/978-1-4419-0300-6</a>.
</p>
</div>
<div id="ref-Wood2013-gz">
<p>
Wood, S. N. (2013). A simple test for random effects in regression models. <em>Biometrika</em> 100, 1005ā1010. doi:<a href="https://doi.org/10.1093/biomet/ast038">10.1093/biomet/ast038</a>.
</p>
</div>
<div id="ref-Wood2017-qi">
<p>
Wood, S. N. (2017). <em>Generalized Additive Models: An introduction with R, second edition</em>. CRC Press.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
or matrices; smooths can have multiple penalty matrices that are stacked block-diagonally in <span class="math inline">()</span>. For simplicityās sake Iām just going to assume a single penalty matrix.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
You can read about this in brief in Ā§5.8 of <span class="citation" data-cites="Wood2017-qi">Wood (2017)</span> and follow up via references therein<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Getting data from the Canada Covid-19 Tracker using R
Gavin L. Simpson
2021-01-31T10:49:00-06:00
2021-01-31T10:49:00-06:00
https://www.fromthebottomoftheheap.net/2021/01/31/getting-data-from-canada-covid-19-tracker-using-r/
<p>
Last semester (Fall 2020) I taught a new course in healthcare data science for the <a href="https://www.schoolofpublicpolicy.sk.ca/">Johnson Shoyama Graduate School in Public Policy</a>. One of the final topics of the course was querying application programming interfaces (APIs) from within R. The example we used was querying data on the Covid 19 pandemic from the <a href="https://covid19tracker.ca">Covid-19 Tracker Canada</a>, which has a simple API thatās easy to work with. In this post Iāll show how we accessed the API from within R and converted the query responses into something we can work with easily.
</p>
<p>
Last semester (Fall 2020) I taught a new course in healthcare data science for the <a href="https://www.schoolofpublicpolicy.sk.ca/">Johnson Shoyama Graduate School in Public Policy</a>. One of the final topics of the course was querying application programming interfaces (APIs) from within R. The example we used was querying data on the Covid 19 pandemic from the <a href="https://covid19tracker.ca">Covid-19 Tracker Canada</a>, which has a simple API thatās easy to work with. In this post Iāll show how we accessed the API from within R and converted the query responses into something we can work with easily.
</p>
<p>
There are many ways of querying APIs in R via a range of packages. Here, Iām going to use <strong>httr</strong> š¦ to query the API and <strong>jsonlite</strong> š¦ to convert what the API responds to our query with into something more useful. The packages we need are listed in the chunk below ā if you donāt have them, uncomment the <code>install.packages()</code> line and change <code>Ncpus</code> to something suitble for your computer.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pkgs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'httr'</span><span class="p">,</span><span class="w"> </span><span class="s1">'jsonlite'</span><span class="p">,</span><span class="w"> </span><span class="s1">'dplyr'</span><span class="p">,</span><span class="w"> </span><span class="s1">'ggplot2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'purrr'</span><span class="p">)</span><span class="w">
</span><span class="c1">## install.packages(pkgs, Ncpus = 4)</span><span class="w">
</span><span class="n">vapply</span><span class="p">(</span><span class="n">pkgs</span><span class="p">,</span><span class="w"> </span><span class="n">library</span><span class="p">,</span><span class="w"> </span><span class="n">logical</span><span class="p">(</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">logical.return</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">character.only</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> httr jsonlite dplyr ggplot2 purrr
TRUE TRUE TRUE TRUE TRUE </code></pre>
</figure>
<p>
The kind of API weāre going to query is a RESTful API ā <strong>RE</strong>presentational <strong>S</strong>tate <strong>T</strong>ransfer. To do the query we need identify the resource we want to query and then send the query using HTTP, the <strong>H</strong>yper<strong>T</strong>ext <strong>T</strong>ransfer <strong>P</strong>rotocol. The resource identity is specified using a uniform resource identifier or URI
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/ps4ds-figure-14-2.png" alt="Graphic showing the parts of a URI. Source: Programming Skills for Data Science (Ross & Freeman, 2018)" />
<figcaption>
Graphic showing the parts of a URI. Source: Programming Skills for Data Science (Ross & Freeman, 2018)
</figcaption>
</figure>
<p>
The URI comprises four parts
</p>
<ol type="1">
<li>
the protocol
</li>
<li>
the base URI
</li>
<li>
the endpoint
</li>
<li>
additional query parameters
</li>
</ol>
<p>
For the Covid-19 Tracker Canada, weāll use the HTTPS protocol for secure HTTP, and its base URI is <code>api.covid19tracker.ca</code>. The <em>endpoint</em> is the specific location of the data you want to access. For the API weāre querying, endpoints include
</p>
<ul>
<li>
<code>/reports</code>
</li>
<li>
<code>/cases</code>
</li>
<li>
<code>/fatalities</code>
</li>
<li>
<code>/provinces</code>
</li>
</ul>
<p>
Endpoints can also allow multiple sub-resources, these are variables and take the form <code>:var_name</code>. For example, the <code>/reports/province</code> endpoint allows the province to be specified as a sub-resource. It is documented as <code>/reports/province/:code</code>, so we would specify endpoints as <code>/reports/province/SK</code> etc, where we are setting <code>:code</code> to <code>SK</code>.
</p>
<p>
The final part of the URI are the query parameters and they allow some fine control over what is requested from the endpoint. These are added as key-value pairs, following a <code>?</code>, and pairs are separated with <code>&</code>. The key is the name of the parameter, and the value is what you want to pass to that parameter. For example, when querying cases, we can specify the province and how many cases are returned per page using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">/</span><span class="n">cases</span><span class="o">?</span><span class="n">province</span><span class="o">=</span><span class="n">ON</span><span class="o">&</span><span class="n">per_page</span><span class="o">=</span><span class="m">50</span></code></pre>
</figure>
<p>
Which endpoints and query parameters are supported are documented in the specific API you are trying to access, so always take some time to familiarise yourself with the API itself. For the Covid-19 Tracker Canada the documentation is also at <a href="https://api.covid19tracker.ca">api.covid19tracker.ca</a>.
</p>
<p>
Itās usually best to build the URI up from these parts stored as separate objects within R
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">base</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"https://api.covid19tracker.ca"</span><span class="w">
</span><span class="n">ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"/reports/province/sk"</span><span class="w">
</span><span class="n">query</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"?date=2021-01-31"</span><span class="w">
</span><span class="n">req</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">base</span><span class="p">,</span><span class="w"> </span><span class="n">ep</span><span class="p">,</span><span class="w"> </span><span class="n">query</span><span class="p">)</span><span class="w">
</span><span class="n">req</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] "https://api.covid19tracker.ca/reports/province/sk?date=2021-01-31"</code></pre>
</figure>
<p>
The HTTP request involves using a <em>verb</em> and the URI ā here we will use the <code>GET</code> verb. In <strong>httr</strong> š¦ the <code>GET</code> verb is found in the <code>GET()</code> function
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">GET</span><span class="p">(</span><span class="n">req</span><span class="p">)</span></code></pre>
</figure>
<p>
The response consists of two parts
</p>
<ol type="1">
<li>
the headers
</li>
<li>
the body
</li>
</ol>
<p>
The headers contain information about the request and response, while the body contains the result of the query. You can access these components of the response using <code>headers()</code> and <code>content()</code> respectively.
</p>
<p>
When you print <code>response</code> youāll see a brief summary of the response metadata
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">response</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Response [https://api.covid19tracker.ca/reports/province/sk?date=2021-01-31]
Date: 2021-01-31 22:42
Status: 200
Content-Type: application/json
Size: 488 B</code></pre>
</figure>
<p>
The status code is important; 200 means <strong>success</strong> and anything else likely indicates some form of failure. Keep an I on the status code of your queries. If youāre wrapping these codes in a function, the <code>warn_for_status()</code> and <code>stop_for_status()</code> functions to query the status and which throw a warning or an error if the request failed respectively.
</p>
<p>
The body of the response can be accessed as a generic R list, as the raw bytes of the response, or as plain text. When viewed as text, we see that the text format is JSON
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">jsonlite</span><span class="o">::</span><span class="n">prettify</span><span class="p">(</span><span class="n">content</span><span class="p">(</span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="s1">'text'</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTF-8'</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">{
"province": "sk",
"data": [
{
"date": "2021-01-31",
"change_cases": 238,
"change_fatalities": 4,
"change_tests": 2459,
"change_hospitalizations": -3,
"change_criticals": 3,
"change_recoveries": 223,
"change_vaccinations": 120,
"change_vaccines_distributed": 0,
"change_vaccinated": 0,
"total_cases": 23863,
"total_fatalities": 304,
"total_tests": 508638,
"total_hospitalizations": 203,
"total_criticals": 31,
"total_recoveries": 21026,
"total_vaccinations": 35359,
"total_vaccines_distributed": 32725,
"total_vaccinated": 4637
}
]
}
</code></pre>
</figure>
<p>
Above, I used the <code>prettify()</code> function to display the JSON in a human-readable format. Note also that Iām specifying the encoding explicitly to be UTF-8 as thatās what my Linux system uses. If youāre not sure about the encoding for your system, just leave the <code>encoding</code> argument off and youāll see a message indicating what encoding was used.
</p>
<p>
To actually parse the JSON into a similar R object we use <code>jsonlite::fromJSON()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">parsed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fromJSON</span><span class="p">(</span><span class="n">content</span><span class="p">(</span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="s1">'text'</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTF-8'</span><span class="p">))</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">parsed</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">List of 2
$ province: chr "sk"
$ data :'data.frame': 1 obs. of 19 variables:
..$ date : chr "2021-01-31"
..$ change_cases : int 238
..$ change_fatalities : int 4
..$ change_tests : int 2459
..$ change_hospitalizations : int -3
..$ change_criticals : int 3
..$ change_recoveries : int 223
..$ change_vaccinations : int 120
..$ change_vaccines_distributed: int 0
..$ change_vaccinated : int 0
..$ total_cases : int 23863
..$ total_fatalities : int 304
..$ total_tests : int 508638
..$ total_hospitalizations : int 203
..$ total_criticals : int 31
..$ total_recoveries : int 21026
..$ total_vaccinations : int 35359
..$ total_vaccines_distributed : int 32725
..$ total_vaccinated : int 4637</code></pre>
</figure>
<p>
What weāre most interested in is the <code><span class="math inline">\(data</code> component, but you can see that <strong>jsonlite</strong> š¦ has converted the JSON to an R list and where appropriate has converted arrays to data frames, as for <code>\)</span>data</code> here. Exactly what is returned by the API will be specific to each API, so read the docmentation for the API you want and look at the structure of what is returned to identify the names of relevant components etc.
</p>
<h2 id="covid-19-cases-per-day">
Covid-19 cases per day
</h2>
<p>
Now that weāve had a crash course in querying an API, letās do something substantive and query the Covid-19 case data for my adopted home province of Saskatchewan. For this we want the <code>/reports</code> endpoint and we can specify the province as a sub-resource.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">base</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://api.covid19tracker.ca'</span><span class="w">
</span><span class="n">ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'/reports/province/sk'</span><span class="w">
</span><span class="n">req</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">base</span><span class="p">,</span><span class="w"> </span><span class="n">ep</span><span class="p">)</span><span class="w">
</span><span class="n">response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">GET</span><span class="p">(</span><span class="n">req</span><span class="p">)</span><span class="w">
</span><span class="n">cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">response</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">content</span><span class="p">(</span><span class="n">as</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'text'</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTF-8'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">fromJSON</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pluck</span><span class="p">(</span><span class="s1">'data'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">as_tibble</span><span class="p">()</span><span class="w">
</span><span class="n">cases</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 373 x 19
date change_cases change_fatalitiā¦ change_tests change_hospitalā¦
<chr> <int> <int> <int> <int>
1 2020ā¦ NA NA 0 0
2 2020ā¦ NA NA 0 0
3 2020ā¦ NA NA 0 0
4 2020ā¦ NA NA 0 0
5 2020ā¦ NA NA 0 0
6 2020ā¦ NA NA 0 0
7 2020ā¦ NA NA 0 0
8 2020ā¦ NA NA 0 0
9 2020ā¦ NA NA 0 0
10 2020ā¦ NA NA 0 0
# ā¦ with 363 more rows, and 14 more variables: change_criticals <int>,
# change_recoveries <int>, change_vaccinations <int>,
# change_vaccines_distributed <int>, change_vaccinated <int>,
# total_cases <int>, total_fatalities <int>, total_tests <int>,
# total_hospitalizations <int>, total_criticals <int>,
# total_recoveries <int>, total_vaccinations <int>,
# total_vaccines_distributed <int>, total_vaccinated <int></code></pre>
</figure>
<p>
At the moment the <code>date</code> variable is stored as a simple character vector. If we convert that to a <code>āDateā</code> object, <strong>ggplot2</strong> š¦ will draw a nicely formatted time axis for us
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">cases</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.Date</span><span class="p">(</span><span class="n">date</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">change_cases</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Cases'</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Daily Covid-19 cases in Saskatchewan'</span><span class="p">,</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Source: N. Little. COVID-19 Tracker Canada (2021), COVID19tracker.ca'</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Removed 47 row(s) containing missing values (geom_path).</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/getting-data-from-canada-covid-19-tracker-using-r-plot-cases-1.png" alt="Daily Covid-19 Cases in Saskatchewan" />
<figcaption>
Daily Covid-19 Cases in Saskatchewan
</figcaption>
</figure>
<p>
Yeah, weāre not doing very well in this province šš¤¬
</p>
<p>
Hope you enjoyed the post ā if you have comments or questions, ask them in the Comment section below.
</p>
<h3 id="references">
References
</h3>
Two new versions of gratia released
Gavin L. Simpson
2021-01-30T15:00:00-06:00
2021-01-30T15:00:00-06:00
https://www.fromthebottomoftheheap.net/2021/01/30/two-new-versions-of-gratia-on-cran/
<p>
While the Covid-19 pandemic and teaching a new course in the fall put paid to most of my development time last year, some time off work this January allowed me time to work on <strong>gratia</strong> š¦ again. I released 0.5.0 to CRAN in part to fix an issue with tests not running on the new M1 chips from Apple because I wasnāt using <strong>vdiffr</strong> š¦ conditionally. Version 0.5.1 followed shortly thereafter as Iād messed up an argument name in <code>smooth_estimates()</code>, a new function that I hope will allow development to proceed more quickly and mke it easier to maintain code and extend functionality to cover a greater range of model types. Read on to fin out more about <code>smooth_estimates()</code> and what else was in these two releases.
</p>
<p>
While the Covid-19 pandemic and teaching a new course in the fall put paid to most of my development time last year, some time off work this January allowed me time to work on <strong>gratia</strong> š¦ again. I released 0.5.0 to CRAN in part to fix an issue with tests not running on the new M1 chips from Apple because I wasnāt using <strong>vdiffr</strong> š¦ conditionally. Version 0.5.1 followed shortly thereafter as Iād messed up an argument name in <code>smooth_estimates()</code>, a new function that I hope will allow development to proceed more quickly and mke it easier to maintain code and extend functionality to cover a greater range of model types. Read on to fin out more about <code>smooth_estimates()</code> and what else was in these two releases.
</p>
<h2 id="evaluating-smooths-with-smooth_estimates">
Evaluating smooths with <code>smooth_estimates()</code>
</h2>
<p>
For a while now Iāve realised that the way Iād implemented <code>evaluate_smooth()</code> wasnāt great. Some design decisions I took earlier on added a lot of unnecessary complexity to the function through handling of factor <code>by</code> smooths, and which didnāt really work properly in the context of a GAM where the same variable could be in multiple smooth terms.
</p>
<p>
My original plan was to use a facetted plot for factor <code>by</code> variable smooths, and so when you selected a model term (more on that later), if that term was a factor <code>by</code> smooth, instead of just pulling in a single smooth, I would pull in all of the smooths associated with the factor <code>by</code>. Handling this got complicated and resulted in some kludgy, messy code that was prone to failure when used with a more specialised smooth or a more complex model.
</p>
<p>
Additionally, how I initially implemented selection of model terms was a bit silly; a user could pass a string for a variable that would be matched against the labels that <strong>mgcv</strong> š¦ uses for smooth. Any instance of the term in any smooth would then get selected, which is not usually what is wanted when working with complex models with multiple smooths, some of which might contin the same variable.
</p>
<p>
Because of this, in the summer I decided to completely rewrite <code>evaluate_smooth()</code>. Then I realised this would not be a good idea as I was going to break a lot of existing code, including code weād written in support of papers that had been published and which used <code>evaluate_smooth()</code>. Instead, I decided to start from a clean slate with a new function that didnāt do any of the silly things Iād messed up <code>evaluate_smooth()</code> with, and which would be much simpler to maintain and develop for a wider range of complex distributional models.
</p>
<p>
In writing <code>smooth_estimates()</code> I also came up with a standard way to represent all evaluations of a smooth, regardless of type. The nice thing about this is that itās easy to return a tibble containing all the values of the evaluated smooth for many smooths at once, something you couldnāt do with <code>evaluate_smooth()</code>.
</p>
<p>
The idea behind <code>evaluate_smooth()</code> and <code>smooth_estimates()</code> is to return a tibble of values of the smooth evaluated at a grid of <code>n</code> points over each of the covariates involved in that smooth.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'gratia'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'tidyr'</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_sim</span><span class="p">(</span><span class="s2">"eg1"</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">gam_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cr"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x3</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ps"</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">smooth_estimates</span><span class="p">(</span><span class="n">gam_model</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 400 x 9
smooth type by est se x0 x1 x2 x3
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s(x0) TPRS NA -1.34 0.392 0.000239 NA NA NA
2 s(x0) TPRS NA -1.26 0.366 0.0103 NA NA NA
3 s(x0) TPRS NA -1.19 0.342 0.0204 NA NA NA
4 s(x0) TPRS NA -1.11 0.319 0.0304 NA NA NA
5 s(x0) TPRS NA -1.03 0.298 0.0405 NA NA NA
6 s(x0) TPRS NA -0.956 0.280 0.0506 NA NA NA
7 s(x0) TPRS NA -0.881 0.264 0.0606 NA NA NA
8 s(x0) TPRS NA -0.806 0.250 0.0707 NA NA NA
9 s(x0) TPRS NA -0.733 0.238 0.0807 NA NA NA
10 s(x0) TPRS NA -0.661 0.229 0.0908 NA NA NA
# ā¦ with 390 more rows</code></pre>
</figure>
<p>
This seems a little wasteful ā all those <code>NA</code> columns š± ā but the output is a consistent wa to represent smooths, regardless of the number of covariates etc.
</p>
<p>
Iām toying with returning the tibble in a nested fashion with <code>nest()</code>, something like
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">sm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">smooth_estimates</span><span class="p">(</span><span class="n">gam_model</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">nest</span><span class="p">(</span><span class="n">values</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">est</span><span class="p">,</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">starts_with</span><span class="p">(</span><span class="s1">'x'</span><span class="p">))</span><span class="w">
</span><span class="n">sm</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 4 x 5
smooth type by values data
<chr> <chr> <chr> <list> <list>
1 s(x0) TPRS NA <tibble [100 Ć 2]> <tibble [100 Ć 4]>
2 s(x1) CRS NA <tibble [100 Ć 2]> <tibble [100 Ć 4]>
3 s(x2) B spline NA <tibble [100 Ć 2]> <tibble [100 Ć 4]>
4 s(x3) P spline NA <tibble [100 Ć 2]> <tibble [100 Ć 4]></code></pre>
</figure>
<p>
which I think is much neater, but does require extra steps from the user to just use the output
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">sm</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">unnest</span><span class="p">(</span><span class="n">cols</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">values</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="p">))</span><span class="w"> </span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 400 x 9
smooth type by est se x0 x1 x2 x3
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s(x0) TPRS NA -1.34 0.392 0.000239 NA NA NA
2 s(x0) TPRS NA -1.26 0.366 0.0103 NA NA NA
3 s(x0) TPRS NA -1.19 0.342 0.0204 NA NA NA
4 s(x0) TPRS NA -1.11 0.319 0.0304 NA NA NA
5 s(x0) TPRS NA -1.03 0.298 0.0405 NA NA NA
6 s(x0) TPRS NA -0.956 0.280 0.0506 NA NA NA
7 s(x0) TPRS NA -0.881 0.264 0.0606 NA NA NA
8 s(x0) TPRS NA -0.806 0.250 0.0707 NA NA NA
9 s(x0) TPRS NA -0.733 0.238 0.0807 NA NA NA
10 s(x0) TPRS NA -0.661 0.229 0.0908 NA NA NA
# ā¦ with 390 more rows</code></pre>
</figure>
<p>
Internally, the individual smooths are nested by default as that makes it easy to join the tibbles for multiple smooth together. As such, the <em>un</em>nested-ness of the current behaviour requires an explicit extra step within <code>smooth_estimates()</code>.
</p>
<p>
If you have thoughts about this, let me know in the comments below.
</p>
<p>
<code>smooth_estimates()</code> is going to supersede <code>evaluate_smooth()</code>, and currently it can handle pretty much everything that <code>evaluate_smooth()</code> can do. That doesnāt mean <code>evaluate_smooth()</code> is going anywhere; as I mentioned above, I donāt want to break old code, so as log as it doesnāt take too much time to maintain <code>evaluate_smooth()</code> isnāt hurting anyone if I put it out to pasture.
</p>
<p>
Version 0.5.0 introduced <code>smooth_estimates()</code> which could only handle very simple univariate smooths, but version 0.5.1 expanded those capabilities. There are a few special smooths that I havenāt yet added capabilities for, including Markov random field smooths and soap film smooths. Support for those will be added by the time version 0.6.0 hits CRAN later this year.
</p>
<h2 id="partial-residuals">
Partial residuals
</h2>
<p>
Version 0.4.0 introduced the ability to add partial residuals to plots of smooths. Version 0.5.0 exposes this functionality for computing partial residuals via new function <code>partial_residuals()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">partial_residuals</span><span class="p">(</span><span class="n">gam_model</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 400 x 4
`s(x0)` `s(x1)` `s(x2)` `s(x3)`
<dbl> <dbl> <dbl> <dbl>
1 -0.236 -1.20 -2.19 0.730
2 0.00545 0.640 -1.79 1.10
3 1.58 1.66 5.59 1.13
4 -1.24 -1.83 -0.892 -0.783
5 -2.21 -0.100 -2.71 -3.10
6 1.27 -1.20 3.93 0.0835
7 -0.599 2.94 -0.793 -1.10
8 1.59 0.402 7.04 2.09
9 2.74 0.449 7.33 2.45
10 1.11 -0.263 0.730 0.703
# ā¦ with 390 more rows</code></pre>
</figure>
<p>
The names are currently non-standard ā hence all the backticks ā and I might change that if I can think of a short hand way to refer to smooths that still allows referencing them uniquely when there are things like factor <code>by</code> smooths involved.
</p>
<p>
I also added an <code>add_partial_residuals()</code>, to add the partial residuals to an existing data frame
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">dat</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">add_partial_residuals</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gam_model</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 400 x 14
y x0 x1 x2 x3 f f0 f1 f2 f3 `s(x0)`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2.99 0.915 0.0227 0.909 0.402 1.62 0.529 1.05 0.0397 0 -0.236
2 4.70 0.937 0.513 0.900 0.432 3.25 0.393 2.79 0.0630 0 0.00545
3 13.9 0.286 0.631 0.192 0.664 13.5 1.57 3.53 8.41 0 1.58
4 5.71 0.830 0.419 0.532 0.182 6.12 1.02 2.31 2.79 0 -1.24
5 7.63 0.642 0.879 0.522 0.838 10.4 1.80 5.80 2.76 0 -2.21
6 9.80 0.519 0.108 0.160 0.917 10.4 2.00 1.24 7.18 0 1.27
7 10.4 0.737 0.980 0.520 0.798 11.3 1.47 7.10 2.75 0 -0.599
8 12.8 0.135 0.265 0.225 0.503 11.4 0.821 1.70 8.90 0 1.59
9 13.8 0.657 0.0843 0.282 0.254 11.1 1.76 1.18 8.20 0 2.74
10 7.51 0.705 0.386 0.504 0.667 6.50 1.60 2.16 2.74 0 1.11
# ā¦ with 390 more rows, and 3 more variables: `s(x1)` <dbl>, `s(x2)` <dbl>,
# `s(x3)` <dbl></code></pre>
</figure>
<p>
but since implementing this I am now questioning whether this is a good thing or rather whether the implementation is a good thing; thereās nothing in the code currently to ensure that the data you provided matches the order of the data used to fit the model ā <em>caveat emptor</em>!
</p>
<h2 id="penalty-matrices">
Penalty matrices
</h2>
<p>
Iāve been adding functions to <strong>gratia</strong> that will be helpful when teaching GAMs; I added <code>basis()</code> a while back and in the 0.5.1 release I added <code>penalty()</code>, for extracting and tidying penalty matrices of smooths from fitted GAM models.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">penalty</span><span class="p">(</span><span class="n">gam_model</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 324 x 6
smooth type penalty row col value
<chr> <chr> <chr> <chr> <chr> <dbl>
1 s(x0) TPRS s(x0) f1 f1 9.81
2 s(x0) TPRS s(x0) f1 f2 -1.45
3 s(x0) TPRS s(x0) f1 f3 -5.00
4 s(x0) TPRS s(x0) f1 f4 -1.34
5 s(x0) TPRS s(x0) f1 f5 -6.24
6 s(x0) TPRS s(x0) f1 f6 3.90
7 s(x0) TPRS s(x0) f1 f7 -7.74
8 s(x0) TPRS s(x0) f1 f8 -1.79
9 s(x0) TPRS s(x0) f1 f9 0
10 s(x0) TPRS s(x0) f2 f1 -1.45
# ā¦ with 314 more rows</code></pre>
</figure>
<p>
There is a <code>draw()</code> method also, to plot the penalty matrix
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">gam_model</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">penalty</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">draw</span><span class="p">()</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/two-new-versions-of-gratia-draw-penalty-1.png" alt="Penalty matrices for smooths from the fitted GAM. Note that in the released version you need to visually flip the y-axis so that diagonal runs top-left to bottom-right to match with how the matrix is actually arranged; this is fixed in the GitHub version." />
<figcaption>
Penalty matrices for smooths from the fitted GAM. Note that in the released version you need to visually flip the y-axis so that diagonal runs top-left to bottom-right to match with how the matrix is actually arranged; this is fixed in the GitHub version.
</figcaption>
</figure>
<p>
It was pointed out that the way this is plotted is not very intuitive if youāre trying to map the way the penalty matrix is written to whatās shown in the plot ā you have to flip the y-axis. This is due to how <code>geom_raster()</code> draws things. I have fixed this, but itās only fixed in the GitHub version of the package, not a current release version.
</p>
<h2 id="colour-scales">
Colour scales
</h2>
<p>
<code>draw.gam()</code> and some related <code>draw()</code> methods now allow you to configure the colour scales used to plot GAMs. Available options include <code>discrete_colour</code>, <code>continuous_colour</code>, and <code>continuous_fill</code>, that take a suitable scale allowing you to change the colour scheme used etc:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">dat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_sim</span><span class="p">(</span><span class="s2">"eg2"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="n">dist</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">gam_model2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">40</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat2</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">draw</span><span class="p">(</span><span class="n">gam_model2</span><span class="p">,</span><span class="w"> </span><span class="n">n_contour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w">
</span><span class="n">continuous_fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ggplot2</span><span class="o">::</span><span class="n">scale_fill_distiller</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Spectral"</span><span class="p">,</span><span class="w">
</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"div"</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/two-new-versions-of-gratia-colour-scales-1.png" alt="Changing the fill scale used by draw()" />
<figcaption>
Changing the fill scale used by <code>draw()</code>
</figcaption>
</figure>
<h2 id="constant-and-fun">
<code>constant</code> and <code>fun</code>
</h2>
<p>
<code>draw.gam()</code> can now plot smooths after addition of a constant and transformation via a function. This can be used to put smooths (sort of) on the response scale. For example, in the code below, I add the model intercept to each smooth when plotting
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">b0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">gam_model</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">draw</span><span class="p">(</span><span class="n">gam_model</span><span class="p">,</span><span class="w"> </span><span class="n">constant</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b0</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/two-new-versions-of-gratia-constant-draw-gam-1.png" alt="Plotting smooths, rescaling the y-axis to include the model intercept term in the scale." />
<figcaption>
Plotting smooths, rescaling the y-axis to include the model intercept term in the scale.
</figcaption>
</figure>
<p>
I plan to add an argument <code>response</code>, which would take a logical to indicate if you wan to plot on the response scale. If <code>response = TRUE</code>, it would override anything passed to <code>constant</code> and <code>fun</code>, such that <code>draw.gam()</code> would just do the right thing, and figure out from the model what constant and inverse link function to use. Watch out for that in 0.6.0.
</p>
<h2 id="excluding-or-selecting-terms-to-include-in-model-predictions">
Excluding or selecting terms to include in model predictions
</h2>
<p>
<code>predict.gam()</code> allows the user to either exclude or specifically include only selected terms in model predictions. Version 0.5.0 added the same functionality in <code>simulate.gam()</code> and <code>predicted_samples()</code>, by allowing you to pass along an <code>exclude</code> or <code>terms</code> argument to <code>predict.gam()</code> that is used in both of these functions.
</p>
<h2 id="summary">
Summary
</h2>
<p>
All in all, these are not major changes to the functionality of <strong>gratia</strong>, but the ground work laid in <code>smooth_estimates()</code> should allow me to address lots of the outstanding bugs related to handling complex model and some complex smooth types, and Iām pretty excited about that.
</p>
Extrapolating with B splines and GAMs
Gavin L. Simpson
2020-06-03T13:00:00-06:00
2020-06-03T13:00:00-06:00
https://www.fromthebottomoftheheap.net/2020/06/03/extrapolating-with-gams/
<p>
An issue that often crops up when modelling with generlaized additive models (GAMs), especially with time series or spatial data, is how to extrapolate beyond the range of the data used to train the model? The issue arises because GAMs use splines to learn from the data using basis functions. The splines themselves are built from basis functions that are typically setup in terms of the data used to fit the model. If there are no basis functions beyond the range of the input data, what exactly is being used if we want to extrapolate? A related issue is that of the wiggliness penalty; depending on the type of basis used, the penalty could extend over the entire real line (-āāā), or only over the range of the input data. In this post I want to take a practical look the extrapolation behaviour of splines in GAMs fitted with the <strong>mgcv</strong> package for R. In particular I want to illustrate how flexible the B spline basis is.
</p>
<p>
An issue that often crops up when modelling with generlaized additive models (GAMs), especially with time series or spatial data, is how to extrapolate beyond the range of the data used to train the model? The issue arises because GAMs use splines to learn from the data using basis functions. The splines themselves are built from basis functions that are typically setup in terms of the data used to fit the model. If there are no basis functions beyond the range of the input data, what exactly is being used if we want to extrapolate? A related issue is that of the wiggliness penalty; depending on the type of basis used, the penalty could extend over the entire real line (-āāā), or only over the range of the input data. In this post I want to take a practical look the extrapolation behaviour of splines in GAMs fitted with the <strong>mgcv</strong> package for R. In particular I want to illustrate how flexible the B spline basis is.
</p>
<p>
A lot of what I discuss in the post draws heavily on the help page in <strong>mgcv</strong> for the B spline basis ā <code>?mgcv::b.spline</code> ā and a recent email discussion with Alex Hayes, Dave Miller, and Eric Pedersen, though what I write here reflects my own input to that discussion.
</p>
<p>
I was initially minded to look into this again after reading a <a href="https://arxiv.org/abs/2004.11408">new preprint</a> on low-rank approximations to a Gaussian process <span class="citation" data-cites="Riutort-Mayol2020-ih">(GP; Riutort-Mayol et al., 2020)</span>, where, among other things, the authors compare the behaviour of the exact GP model with their low-rank version and with a thin plate regression spline (TPRS). The TPRS is the sort of thing youād get by default with <strong>mgcv</strong> and <code>s()</code>, but as the other models were all fully Bayesian, the TPRS model was fitted using <code>brm()</code> from the <strong>brms</strong> package so that all the models were comparable, ultimately being fitted in <strong>Stan</strong>. The TPRS model didnāt do a very good job of fitting the test observations when extrapolating beyond the limits of the data. I wondered if we could do any better with the B spline basis in <strong>mgcv</strong> as I knew it had extra flexibility for short extrapolation beyond the data, but Iād never really looked into how it worked or what the respective behaviour was.
</p>
<p>
If you want to recreate elements of the rest of the post, youāll need the following packages installed:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## Packages</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'tibble'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'tidyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'gratia'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'patchwork'</span><span class="p">)</span><span class="w">
</span><span class="c1">## remotes::install_github("clauswilke/colorblindr")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'colorblindr'</span><span class="p">)</span><span class="w">
</span><span class="c1">## remotes::install_github("clauswilke/relayer")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'relayer'</span><span class="p">)</span></code></pre>
</figure>
<p>
The last two are used for plotting and the <strong>relayer</strong> package in particular is needed as Iām going to be using two separate colour scales on the plots. If you donāt have these installed, you can install them using the <strong>remotes</strong> package and the code in commented lines above.
</p>
<p>
The example data set used in the comparsion had been posted to the preprintās GitHub repo, so it was easy to grab them and start playing with. To load the data into R we can use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">load</span><span class="p">(</span><span class="n">url</span><span class="p">(</span><span class="s2">"https://bit.ly/gprocdata"</span><span class="p">))</span><span class="w">
</span><span class="n">ls</span><span class="p">()</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] "f_true"</code></pre>
</figure>
<p>
where the Bitly short link just links to the <code>.Rdata</code> file stored on GitHub. This creates an object, <code>f_true</code>, in the workspace. Weāll look at the true function in a minute. Following the preprint, a data set of noisy observations is simulated from the true function by adding Gaussian noise (Ī¼ = 0, Ļ = 0.2)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">seed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1234</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span><span class="w">
</span><span class="n">gp_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">truth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unname</span><span class="p">(</span><span class="n">f_true</span><span class="p">),</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.002</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">truth</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">truth</span><span class="p">),</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">))</span></code></pre>
</figure>
<p>
From that noisy set, we sample 250 observations at random, and indicate some of the observations as being in a test set that we wonāt use when fitting GAMs
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span><span class="w">
</span><span class="n">r_samp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample_n</span><span class="p">(</span><span class="n">gp_data</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">250</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">data_set</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">-0.8</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"test"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0.8</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"test"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">-0.45</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">-0.36</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"test"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">-0.05</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">0.05</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"test"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0.45</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">0.6</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"test"</span><span class="p">,</span><span class="w">
</span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"train"</span><span class="p">))</span></code></pre>
</figure>
<p>
Finally we visualize the true function and the noisy observations we sampled from it
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_set</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">truth</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">),</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Data set"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-plot-noisy-data-1.svg" alt="The true function and noisy observations drawn from it. The blue dots are the training observations that weāll use to fit models, while the red dots are test observations used to investigate how the models interpolate and extrapolate." />
<figcaption>
The true function and noisy observations drawn from it. The blue dots are the training observations that weāll use to fit models, while the red dots are test observations used to investigate how the models interpolate and extrapolate.
</figcaption>
</figure>
<p>
The red points are the test observations and will be used to look at the behaviour of the splines under interpolating and extrapolating conditions.
</p>
<h2 id="thin-plate-splines">
Thin Plate splines
</h2>
<p>
Firstly, weāll look at how the thin plate splines behave under extrapolation, recreating the behaviour from the preprint. I start by fitting two GAMs where we use 50 basis functions (<code>k = 50</code>) from the TPRS basis (<code>bs = ātpā</code>). The argument <code>m</code> controls the order of the derivative penalty; the default is <code>m = 2</code>, for a second derivative penalty (penalising the curvature fo the spline). For the second model, we use <code>m = 1</code>, indicating a penalty on the first derivative of the TPRS, which penalises deviations from a flat function. Note that we filter the sample of noisy data to include only the training observations.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_tprs2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tp"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="c1">## first order penalty</span><span class="w">
</span><span class="n">m_tprs1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"tp"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
I wonāt worry about looking at model diagnostics in this post, and instead skip to the looking at how these two models behave when we predict beyond the limits of the training data.
</p>
<p>
Next I define some new observations to predict at from the two models
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">new_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.002</span><span class="p">))</span></code></pre>
</figure>
<p>
Remember the training data covered the interval -0.8ā0.8, so weāre extrapolating quite far proportionally from the support of the training data. Now we can predict from the two models
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p_tprs2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_tprs2</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_tprs_2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_tprs_2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_tprs1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_tprs1</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_tprs_1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_tprs_1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span></code></pre>
</figure>
<p>
Note we have named the two columns of data with some information that weāll need for plotting, so the underscores are important.
</p>
<p>
Next we do some data wrangling to get the predictions into a tidy format suitable for plotting
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">crit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qnorm</span><span class="p">((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">0.89</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">new_data_tprs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">p_tprs2</span><span class="p">,</span><span class="w"> </span><span class="n">p_tprs1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">fit_tprs_2</span><span class="o">:</span><span class="n">se_tprs_1</span><span class="p">,</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'_'</span><span class="p">,</span><span class="w">
</span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'variable'</span><span class="p">,</span><span class="w"> </span><span class="s1">'spline'</span><span class="p">,</span><span class="w"> </span><span class="s1">'order'</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">upr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="n">lwr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre>
</figure>
<p>
The basic idea here is that we cast the data to a very general long-and-thin version and pull out variables indicating the type of value (<code>fit</code> = fitted and <code>se</code> = standard error), the type of spline, and the order of the penalty, by splitting on the underscore in each of the input columns. Then we cast the long-and-thin data frame to a slightly wider version where we have access to the <code>fit</code> and <code>se</code> variables, before calculating a 89% credible interval on the predicted values.
</p>
<p>
Now we can plot the data plus the predicted values
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_tprs</span><span class="p">,</span><span class="w">
</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">),</span><span class="w">
</span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_set</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_tprs</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">colour2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">),</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename_geom_aes</span><span class="p">(</span><span class="n">new_aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"colour"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">,</span><span class="w"> </span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour"</span><span class="p">,</span><span class="w">
</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Data set"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_OkabeIto</span><span class="p">(</span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_OkabeIto</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Extrapolating with thin plate splines"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"How behaviour varies with derivative penalties of different order"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-plot-tprs-predictions-1.svg" alt="Posterior predictive means for the two thin plate regression spline models showing the interpolation and extrapolation behaviour with first and second derivative penalties." />
<figcaption>
Posterior predictive means for the two thin plate regression spline models showing the interpolation and extrapolation behaviour with first and second derivative penalties.
</figcaption>
</figure>
<p>
With the default, second derivative penalty we see that under extrapolation, the spline exhibits linear behaviour. For the first deriavtive penalty model, the behaviour is to predict a constant value. The credible intervals are also unrealistically narrow in the case of the TPRS model with the first derivative penalty. Neither does a particularly good job of estimating any of the test samples outside the range of <em>x</em> in the training data. The models do better when interpolating, except for the section around <em>x</em> = 0.5.
</p>
<h2 id="b-splines">
B splines
</h2>
<p>
OK. What about B splines? With the B spline constructor in <strong>mgcv</strong> we have a lot of control over how we set up the basis and the wiggliness penalty. Weāll look at more of these options later, but first, weāll look at the default behaviour where the penalty only operates over the range of the training observations.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_bs_default</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients</code></pre>
</figure>
<p>
Here we asked for a cubic B spline with a second order penalty ā this is your common or garden cubic B spline where the wigglines penalty over covers the range of <em>x</em> in the training data. Ignore the warning; this is just because we have many functions and some arenāt supported by any of the data because of the holes due to the test observations.
</p>
<p>
If we want to have the penalty extend some way beyond the range of <em>x</em>, we need to pass in a set of end points over which knots will be defined. We need to specify the two extreme end points that enclose the region we want to predict over, and two interior knots that cover the range of the data, plus a little. We specify these knots below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">knots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-0.9</span><span class="p">,</span><span class="w"> </span><span class="m">0.9</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span></code></pre>
</figure>
<p>
and then pass <code>knots</code> to the <code>knots</code> argument when fitting the model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_bs_extrap</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients</code></pre>
</figure>
<p>
The only difference here is how we have specified we want the penalty to extend away from the limits of the training observations. Youāll get another warning here. This will always happen when you set outer knots beyond the range of the data; it is harmless.
</p>
<p>
We can visualize the differences in the bases using <code>basis()</code> from the <strong>gratia</strong> package
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">bs_default</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">basis</span><span class="p">(</span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">-0.8</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">0.8</span><span class="p">))</span><span class="w">
</span><span class="n">bs_extrap</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">basis</span><span class="p">(</span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data</span><span class="p">)</span><span class="w">
</span><span class="n">lims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lims</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">))</span><span class="w">
</span><span class="n">vlines</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geom_vline</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.8</span><span class="p">,</span><span class="w"> </span><span class="m">0.8</span><span class="p">)),</span><span class="w">
</span><span class="n">aes</span><span class="p">(</span><span class="n">xintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">),</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">)</span><span class="w">
</span><span class="p">(</span><span class="n">draw</span><span class="p">(</span><span class="n">bs_default</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lims</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">vlines</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">draw</span><span class="p">(</span><span class="n">bs_extrap</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">lims</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">vlines</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">plot_annotation</span><span class="p">(</span><span class="n">tag_levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'A'</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-visualize-bases-1.svg" alt="Cubic B spline bases with knots covering the range of training observations (A) and with outer knots covering the range of the training data plus the region where we want to extrapolate. Using the outer knots has the effect of extending the wigliness penalty over the region we want to predict for. The dashed lines are drawn at x = -0.8 and x = 0.8, the limits of the training observations." />
<figcaption>
Cubic B spline bases with knots covering the range of training observations (A) and with outer knots covering the range of the training data plus the region where we want to extrapolate. Using the outer knots has the effect of extending the wigliness penalty over the region we want to predict for. The dashed lines are drawn at <em>x</em> = -0.8 and <em>x</em> = 0.8, the limits of the training observations.
</figcaption>
</figure>
<p>
Technically, the basis functions in the top panel would extend a little into the prediction region, but <code>basis()</code> canāt yet handle using one data set to set up the basis and another at which to evaluate it. Because we have basis functions extending over the interval for prediction, the wiggliness penalty can apply in this region too.
</p>
<p>
Now we predict from both the models as before and repeat the data wrangling
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p_bs_default</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_default</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_default</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_default</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_bs_extrap</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_extrap</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_extrap</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_extrap</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">new_data_bs_eg</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_default</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_extrap</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">fit_bs_default</span><span class="o">:</span><span class="n">se_bs_extrap</span><span class="p">,</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'_'</span><span class="p">,</span><span class="w">
</span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'variable'</span><span class="p">,</span><span class="w"> </span><span class="s1">'spline'</span><span class="p">,</span><span class="w"> </span><span class="s1">'penalty'</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">upr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="n">lwr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre>
</figure>
<p>
The only difference here is that I encoded in the variable names whether we used the default penalty or the one extended beyond the limits of the data. We plot the fits with
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_bs_eg</span><span class="p">,</span><span class="w">
</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">penalty</span><span class="p">),</span><span class="w">
</span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_set</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_bs_eg</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">colour2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">penalty</span><span class="p">),</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename_geom_aes</span><span class="p">(</span><span class="n">new_aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"colour"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">,</span><span class="w"> </span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Data set"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_OkabeIto</span><span class="p">(</span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_OkabeIto</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Extrapolating with B splines"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"How behaviour varies when the penalty extends beyond the data"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-plot-bs-eg-predictions-1.svg" alt="Posterior predictive means for the two B spline models showing the interpolation and extrapolation behaviour when the penalty over covers the range of the data and when it extends byond that range." />
<figcaption>
Posterior predictive means for the two B spline models showing the interpolation and extrapolation behaviour when the penalty over covers the range of the data and when it extends byond that range.
</figcaption>
</figure>
<p>
As both these models used second derivative penalties, they both extrapolate linearlly beyond the range of the training observations. Importantly however, we get very different behaviour of the credible intervals, especially at the low end of <em>x</em>, where the wide interval is a better representation of the uncertainty that we have in the extrapolated predictions. This is better behaviour, as at least weāre being honest about the uncertainty when extrapolating.
</p>
<h2 id="comparing-different-bases">
Comparing different bases
</h2>
<p>
So far, so uninteresting. Before we get to the good stuff and demonstrate other features of the B spline basis in <em>mgcv</em>, letās just quickly compare the TPRS and B spline models with a Gaussian process smooth that is designed to closely match the data generating function. Note that this GP is fitted using <strong>mgcv</strong> where we have to specify the length scale, and as such isnāt meant to be directly comparable with either the exact or the low-rank GP models of <span class="citation" data-cites="Riutort-Mayol2020-ih">Riutort-Mayol et al.Ā (2020)</span>.
</p>
<p>
In <strong>mgcv</strong> a GP can be fit using <code>bs = āgpā</code>. When we do this, the meaning of the <code>m</code> argument changes. Here we are asking for a MatĆ©rn covariance function with Ī½ = 3/2 and length scale of 0.15. These values were chosen to match those of the true function.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_gp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gp"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">0.15</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
Again we have some wrangling to do to pull all these together into an object we can plot easily
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p_bs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_extrap</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_tprs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_tprs2</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_tprs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_tprs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_gp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_gp</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_gp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_gp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">new_data_bases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">p_tprs</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs</span><span class="p">,</span><span class="w"> </span><span class="n">p_gp</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">fit_tprs</span><span class="o">:</span><span class="n">se_gp</span><span class="p">,</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'_'</span><span class="p">,</span><span class="w">
</span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'variable'</span><span class="p">,</span><span class="w"> </span><span class="s1">'spline'</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">upr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="n">lwr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre>
</figure>
<p>
And finally we plot using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_bases</span><span class="p">,</span><span class="w">
</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">spline</span><span class="p">),</span><span class="w">
</span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_set</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_bases</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">colour2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">spline</span><span class="p">),</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename_geom_aes</span><span class="p">(</span><span class="n">new_aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"colour"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">,</span><span class="w"> </span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Data set"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_OkabeIto</span><span class="p">(</span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Basis"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_OkabeIto</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Basis"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Extrapolating with splines"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"How behaviour varies with different basis types"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Ignoring unknown aesthetics: colour2</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-plot-mixed-spline-predictions-1.svg" alt="Posterior predictive means for three GAMs; a thin plate spline with 2nd derivative penalty, a B spline with 2nd derivative penalty extended over the interval for prediction, and a Gaussian process with a MatĆ©rn(Ī½ = 3/2) covariance function with length scale = 0.15" />
<figcaption>
Posterior predictive means for three GAMs; a thin plate spline with 2nd derivative penalty, a B spline with 2nd derivative penalty extended over the interval for prediction, and a Gaussian process with a MatĆ©rn(Ī½ = 3/2) covariance function with length scale = 0.15
</figcaption>
</figure>
<p>
Clearly the GP gets closer to the test data when extrapolating, but thatās not really a fair comparison as I told the model what the correct length scale was! We could try to estimate that from the data, by fitting models over a grid of likely values for the length scale parameter and using the model with the lowest REML score, but I wonāt show how to do that here; I have example code in the supplements for <span class="citation" data-cites="Simpson2018-frontiers">Simpson (2018)</span> showing how to do this is youāre keen.
</p>
<h2 id="more-with-b-splines">
More with B splines
</h2>
<p>
Weāre not restricted to using the second derivative penalty with B splines; we can use third, second, first or even zeroth order penalties with cubic B splines. How does their behaviour vary when interpolating and extrapolating?
</p>
<p>
For convenience Iāll just fit all three models with a common format, even though weāve already seen and fitted the first model with the second derivative penalty. Notice how we specify the order of the derivative penalty by passing a second value to the argument <code>m</code>; <code>m = 1</code> is a first derivative penalty, <code>m = 0</code> a zeroth derivative penalty, etc.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_bs_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span><span class="w">
</span><span class="n">m_bs_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span><span class="w">
</span><span class="n">m_bs_0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span></code></pre>
</figure>
<p>
Again we repeat the data wrangling need to get something we can plot
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p_bs_2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_2</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_bs_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_1</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_bs_0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_0</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">new_data_order</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_2</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_1</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_0</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">fit_bs_2</span><span class="o">:</span><span class="n">se_bs_0</span><span class="p">,</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'_'</span><span class="p">,</span><span class="w">
</span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'variable'</span><span class="p">,</span><span class="w"> </span><span class="s1">'spline'</span><span class="p">,</span><span class="w"> </span><span class="s1">'order'</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">upr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="n">lwr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre>
</figure>
<p>
Note again how Iām defining the names of the columns containing fitted values and their standard errors to make it easy to pull out this data during the <code>pivot_longer()</code> step.
</p>
<p>
We plot the predicted values with
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_order</span><span class="p">,</span><span class="w">
</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">),</span><span class="w">
</span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_set</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_order</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">colour2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">order</span><span class="p">),</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename_geom_aes</span><span class="p">(</span><span class="n">new_aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"colour"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">,</span><span class="w"> </span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Data set"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_OkabeIto</span><span class="p">(</span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_OkabeIto</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Extrapolating with B splines"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"How behaviour varies with penalties of different order"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-plot-b-spline-diff-penalties-1.svg" alt="Posterior predictive means for three GAMs using B splines with different orders of derivative penalty, all covering the region where we want to predict for the test samples; a B spline with 2nd derivative penalty, a B spline with 1st derivative penalty, and a B spline with zeroth derivative penalty." />
<figcaption>
Posterior predictive means for three GAMs using B splines with different orders of derivative penalty, all covering the region where we want to predict for the test samples; a B spline with 2nd derivative penalty, a B spline with 1st derivative penalty, and a B spline with zeroth derivative penalty.
</figcaption>
</figure>
<p>
The plot shows the different penalties leading to quite a wide range of behaviour. The spline with the zeroth order penalty interpolates poorly, seemingly heading towards the overall mean of the data during each of the test section within the range of <em>x</em>. When extrapolating, we again see this āmean reversionā behaviour, which means it does well when extrapolating for large values of <em>x</em>, but it does extremely poorly at the low end of <em>x</em>. The credible intervals for this model are also unrealistically narrow, like those of the TPRS model with 1st derivative penalty that we saw earlier on.
</p>
<p>
The model with the first derivatve penalty has reaonable behaviour; it extrapolates as largely a flat function continuing from the min and maximum values of <em>x</em>, as with the TPRS fit with a first derivative penalty we saw above, but the credible intervals are much more realistic for the B spline than for the TPRS. Note also that the intervals for the B spline with the first derivative penalty donāt explode as quickly as those for the B spline fit with the second derivative penalty.
</p>
<h2 id="multiple-penalties">
Multiple penalties
</h2>
<p>
One final trick that the B spline basis in <strong>mgcv</strong> has up its sleve is that you can combine multiple penalties in a single spline. We could fit cubic B splines with one, two, three, or even four penalties. The additional penalties are specified by passing more values to <code>m</code>: <code>m = c(3, 2, 1)</code> would be a cubic B spline with both a second derivative and a first derivative penalty, while <code>m = c(3, 2, 1, 0)</code> would get you a cubic spline with all three penalties. You can mix and match as much as you like with a couple of exceptions:
</p>
<ul>
<li>
you can only have one penalty for each order, so no, you canāt penalise one of the derivative more stringly by adding more than one penalty for it; <code>m = c(3, 2, 2, 1)</code> for example <em>isnāt</em> allowed, and
</li>
<li>
you can only have values for <code>m[i]</code> (where <code>i</code> > 1) that exist for the given order of B spline, i.e.Ā where <code>m[i] ā¤ m[1]</code>.
</li>
</ul>
<p>
In the code below I fit two additional models with mixtures of penalties, and then compare these with the default second derivative penalty (fitted earlier). In each case, Iām again using the <code>knots</code> argument to extend the penalties over the range we might want to predict over.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_bs_21</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m_bs_210</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">)),</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">data_set</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"train"</span><span class="p">),</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning in smooth.construct.bs.smooth.spec(object, dk$data, dk$knots): there is
*no* information about some basis coefficients</code></pre>
</figure>
<p>
Again, we do the same wrangling, this time encoding the mixtures of orders in the column names
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p_bs_21</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_21</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_21</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_21</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">p_bs_210</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m_bs_210</span><span class="p">,</span><span class="w"> </span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="n">fit_bs_210</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se_bs_210</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="n">new_data_multi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">new_data</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_2</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_21</span><span class="p">,</span><span class="w"> </span><span class="n">p_bs_210</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">fit_bs_2</span><span class="o">:</span><span class="n">se_bs_210</span><span class="p">,</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'_'</span><span class="p">,</span><span class="w">
</span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'variable'</span><span class="p">,</span><span class="w"> </span><span class="s1">'spline'</span><span class="p">,</span><span class="w"> </span><span class="s1">'order'</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">upr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="n">lwr_ci</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w">
</span><span class="n">penalty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">order</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"2"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"2"</span><span class="p">,</span><span class="w">
</span><span class="n">order</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"21"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"2, 1"</span><span class="p">,</span><span class="w">
</span><span class="n">order</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"210"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"2, 1, 0"</span><span class="p">))</span></code></pre>
</figure>
<p>
The last step here uses <code>case_when()</code> to write out nicer formatting for the penalties, so we get a nicer legend on the plot, which we produce with
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_multi</span><span class="p">,</span><span class="w">
</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upr_ci</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">penalty</span><span class="p">),</span><span class="w">
</span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_samp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data_set</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">new_data_multi</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">colour2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">penalty</span><span class="p">),</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename_geom_aes</span><span class="p">(</span><span class="n">new_aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"colour"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Set1"</span><span class="p">,</span><span class="w"> </span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Data set"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_OkabeIto</span><span class="p">(</span><span class="n">aesthetics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"colour2"</span><span class="p">,</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_OkabeIto</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Penalty"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Extrapolating with B splines"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"How behaviour changes when combining multiple penalties"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/extrapolating-with-b-splines-and-gams-plot-b-spline-with-mixed-penalties-1.svg" alt="Posterior predictive means for three GAMs using B splines with mixtures of derivative penalties, all covering the region where we want to predict for the test samples; a B spline with single 2nd derivative penalty, a B spline with a 2nd and 1st derivative penalties, and a B spline with 2nd, 1st and 0th derivative penalties." />
<figcaption>
Posterior predictive means for three GAMs using B splines with mixtures of derivative penalties, all covering the region where we want to predict for the test samples; a B spline with single 2nd derivative penalty, a B spline with a 2nd and 1st derivative penalties, and a B spline with 2<sup>nd</sup>, 1<sup>st</sup> and 0<sup>th</sup> derivative penalties.
</figcaption>
</figure>
<p>
By mixing the penalties, we mix some of the behaviour features. For example, the weird interpolation behaviour of the B spline with zeroth derivative penalty is essentially removed when combined with second and first derivative penalties.
</p>
<p>
Given the data, the predictions that essentially predict constant functions beyond the range of the data, but with wide credible intervals are probably the most realistic; in each case where we used a B spline that included a first derivative penalty has at least covered most of the test observation beyond the range of <em>x</em>.
</p>
<p>
However, in none of the fits do we get behaviour that get close to fitting the test observations beyond the training of <em>x</em> in the training data, even when using a Gaussian process that supposedly matches at least the general form of the true function.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Riutort-Mayol2020-ih">
<p>
Riutort-Mayol, G., BĆ¼rkner, P.-C., Andersen, M. R., Solin, A., and Vehtari, A. (2020). Practical hilbert space approximate bayesian gaussian processes for probabilistic programming. Available at: <a href="http://arxiv.org/abs/2004.11408">http://arxiv.org/abs/2004.11408</a>.
</p>
</div>
<div id="ref-Simpson2018-frontiers">
<p>
Simpson, G. L. (2018). Modelling palaeoecological time series using generalised additive models. <em>Frontiers in Ecology and Evolution</em> 6, 149. doi:<a href="https://doi.org/10.3389/fevo.2018.00149">10.3389/fevo.2018.00149</a>.
</p>
</div>
</div>
gratia 0.4.1 released
Gavin L. Simpson
2020-05-31T07:00:00-06:00
2020-05-31T07:00:00-06:00
https://www.fromthebottomoftheheap.net/2020/05/31/new-gratia-release/
<p>
After a slight snafu related to the 1.0.0 release of <strong>dplyr</strong>, a new version of <strong>gratia</strong> is out and available on CRAN. This release brings a number of new features, including differences of smooths, partial residuals on partial plots of univariate smooths, and a number of utility functions, while under the hood <strong>gratia</strong> works for a wider range of models that can be fitted by <strong>mgcv</strong>.
</p>
<p>
After a slight snafu related to the 1.0.0 release of <strong>dplyr</strong>, a new version of <strong>gratia</strong> is out and available on CRAN. This release brings a number of new features, including differences of smooths, partial residuals on partial plots of univariate smooths, and a number of utility functions, while under the hood <strong>gratia</strong> works for a wider range of models that can be fitted by <strong>mgcv</strong>.
</p>
<h3 id="partial-residuals">
Partial residuals
</h3>
<p>
The <code>draw()</code> method for <code>gam()</code> and related models produces partial effects plots. <code>plot.gam()</code> has long had the ability to add partial residuals to partial plots of univariate smooths, and with the latest release <code>draw()</code> can now do so too.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">df1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_sim</span><span class="p">(</span><span class="s2">"eg1"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">400</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x3</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df1</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">draw</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">residuals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/gratia-0-4-1-release-partial-residuals-1.png" alt="Partial plots of estimated smooth functions with partial residuals" />
<figcaption>
Partial plots of estimated smooth functions with partial residuals
</figcaption>
</figure>
<p>
If the estimated functions have the correct degree of wiggliness, the partial residuals should be approximately uniformly distributed about the estimated smooth.
</p>
<h3 id="simulating-data">
Simulating data
</h3>
<p>
The previous example demonstrated another new feature of the latest release; <code>data_sim()</code>. This is a reimplementation of <code>mgcv::gamSim()</code>, which is used to simulate data for testing GAMs. Data can be simulated from several widely-used functions that illustrate the power an capabilities of estimating smooth functions using penalised splines.
</p>
<p>
<code>data_sim()</code> returns simulated data in a tidy fashion and all the various example test data sets return consistently. Also, data from the example functions can be simulated from a number of probability distributions ā currently the Gaussian, Poisson, and Bernoulli distributions are supported, but future versions will offer a wider range to simulate from.
</p>
<p>
For example, the response data modelled above came from the following four functions used by Gu and Wahba
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">df1</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">df1</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="o">:</span><span class="n">x3</span><span class="p">,</span><span class="w"> </span><span class="n">f0</span><span class="o">:</span><span class="n">f3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">x0</span><span class="o">:</span><span class="n">f3</span><span class="p">,</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"var"</span><span class="p">,</span><span class="w"> </span><span class="s2">"fun"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">var</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">f</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">fun</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/gratia-0-4-1-release-data-sim-1.png" alt="Gu and Wahba four term additive example functions" />
<figcaption>
Gu and Wahba four term additive example functions
</figcaption>
</figure>
<h3 id="difference-smooths">
Difference smooths
</h3>
<p>
When GAMs contain smooth-factor interactions, we often want to compare smooths between levels of the factor to determine how the smooth effects vary between groups. The new release contains a function <code>difference_smooths()</code> that implements this idea.
</p>
<p>
The <strong>mgcv</strong> example for factor-smooth interactions using the <code>by</code> mechanism can be simulated from using <code>data_sim()</code>. The model fitted to the data contains a smooth of covariate <code>x1</code> and a smooth of <code>x2</code> for each level of the factor <code>fac</code>. Note that we need the parametric effect for <code>fac</code> as the <code>by</code> smooths are all centred about 0; the parametric term models the different group means.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_sim</span><span class="p">(</span><span class="s2">"eg4"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">fac</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fac</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x0</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
<code>difference_smooths()</code> returns differences between the smooth functions for all pairs of the levels of <code>fac</code>, plus a credible interval for the difference.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">sm_diffs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">difference_smooths</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">smooth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"s(x2)"</span><span class="p">)</span><span class="w">
</span><span class="n">sm_diffs</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 300 x 9
smooth by level_1 level_2 diff se lower upper x2
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 s(x2) fac 1 2 0.797 0.536 -0.253 1.85 0.00170
2 s(x2) fac 1 2 0.846 0.500 -0.135 1.83 0.0118
3 s(x2) fac 1 2 0.896 0.467 -0.0190 1.81 0.0219
4 s(x2) fac 1 2 0.945 0.435 0.0929 1.80 0.0319
5 s(x2) fac 1 2 0.994 0.405 0.200 1.79 0.0420
6 s(x2) fac 1 2 1.04 0.378 0.302 1.78 0.0521
7 s(x2) fac 1 2 1.09 0.354 0.397 1.78 0.0622
8 s(x2) fac 1 2 1.14 0.332 0.485 1.79 0.0722
9 s(x2) fac 1 2 1.18 0.314 0.566 1.80 0.0823
10 s(x2) fac 1 2 1.22 0.298 0.641 1.81 0.0924
# ā¦ with 290 more rows</code></pre>
</figure>
<p>
There is a <code>draw()</code> method for objects returned by <code>difference_smooths()</code>, which will plot the pairwise differences
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">draw</span><span class="p">(</span><span class="n">sm_diffs</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/gratia-0-4-1-release-plot-difference-smooths-1.png" alt="Differences between estimated smooth functions" />
<figcaption>
Differences between estimated smooth functions
</figcaption>
</figure>
<p>
Note that these differences exclude differences in the group means and the differences between smooths are computed on the scale of the link function. A future version will allow for differences that include the group means.
</p>
<h3 id="fitted-values-and-residuals-utility-functions">
Fitted values and residuals utility functions
</h3>
<p>
Two new utility functions are in the current release, <code>add_fitted()</code> and <code>add_residuals()</code> add fitted values and residuals to a data frame of observations used to fit a model.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">df1</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">add_fitted</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".fitted"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">add_residuals</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".resid"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 400 x 12
y x0 x1 x2 x3 f f0 f1 f2 f3 .fitted .resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2.99 0.915 0.0227 0.909 0.402 1.62 0.529 1.05 0.0397 0 2.57 0.419
2 4.70 0.937 0.513 0.900 0.432 3.25 0.393 2.79 0.0630 0 3.91 0.788
3 13.9 0.286 0.631 0.192 0.664 13.5 1.57 3.53 8.41 0 12.9 1.03
4 5.71 0.830 0.419 0.532 0.182 6.12 1.02 2.31 2.79 0 6.57 -0.859
5 7.63 0.642 0.879 0.522 0.838 10.4 1.80 5.80 2.76 0 10.3 -2.67
6 9.80 0.519 0.108 0.160 0.917 10.4 2.00 1.24 7.18 0 9.23 0.571
7 10.4 0.737 0.980 0.520 0.798 11.3 1.47 7.10 2.75 0 11.2 -0.754
8 12.8 0.135 0.265 0.225 0.503 11.4 0.821 1.70 8.90 0 11.0 1.77
9 13.8 0.657 0.0843 0.282 0.254 11.1 1.76 1.18 8.20 0 11.5 2.28
10 7.51 0.705 0.386 0.504 0.667 6.50 1.60 2.16 2.74 0 6.71 0.792
# ā¦ with 390 more rows</code></pre>
</figure>
<h3 id="other-changes">
Other changes
</h3>
<p>
This release contains a number of other less-visible changes. <strong>gratia</strong> now handles models fitted by <code>gamm4::gamm4()</code> in more functions than before, while the utility functions <code>link()</code> and <code>inv_link()</code> now work for all families in <strong>mgcv</strong>, including the general family functions and those used for fitting location scale models.
</p>
Rendering your README with GitHub Actions
Gavin L. Simpson
2020-04-30T14:30:00-06:00
2020-04-30T14:30:00-06:00
https://www.fromthebottomoftheheap.net/2020/04/30/rendering-your-readme-with-github-actions/
<p>
Thereās one thing that has bugged me for a while about developing R packages. We have all these nice, modern tools we have for tracking our code, producing web sites from the <strong>roxygen</strong> documentation, an so on. Yet for every code commit I make to the master branch of a package repo, thereās often two or more additional steps I need to take to keep the package <code>README.md</code> and <em>pkgdown</em> site in sync with the code. Donāt get me wrong; itās amazing that we have these tools available to help users get to grips with our R packages. Itās just that thereās a lot of extra things to remember to do to keep everything up to date. The development of free-to-use services such as Travis CI or Appveyor have been very useful as they can automate many of these repetitive tasks. A more recent newcomer to the field is <a href="https://github.com/features/actions">GitHub Actions</a>. The other day I was grappling with getting a GitHub Actions workflow to render a <code>README.Rmd</code> file to <code>README.md</code> on GitHub, so that I didnāt have to do it locally all the time. After a lot of trial and error, this is how I got it working.
</p>
<p>
Thereās one thing that has bugged me for a while about developing R packages. We have all these nice, modern tools we have for tracking our code, producing web sites from the <strong>roxygen</strong> documentation, an so on. Yet for every code commit I make to the master branch of a package repo, thereās often two or more additional steps I need to take to keep the package <code>README.md</code> and <em>pkgdown</em> site in sync with the code. Donāt get me wrong; itās amazing that we have these tools available to help users get to grips with our R packages. Itās just that thereās a lot of extra things to remember to do to keep everything up to date. The development of free-to-use services such as Travis CI or Appveyor have been very useful as they can automate many of these repetitive tasks. A more recent newcomer to the field is <a href="https://github.com/features/actions">GitHub Actions</a>. The other day I was grappling with getting a GitHub Actions workflow to render a <code>README.Rmd</code> file to <code>README.md</code> on GitHub, so that I didnāt have to do it locally all the time. After a lot of trial and error, this is how I got it working.
</p>
<p>
The general use case I am imagining here is the package author that has a <code>README.Rmd</code> file that contains R code chunks, which they want to render to <code>README.md</code> so it will get displayed nicely on GitHub. You might want to do this to provide a simple overview of how to use some key functionality of your package or show off a plot or two that can be generated by the package. Itās pretty easy to render this locally with a <code>Makefile</code> or by simply invoking the correct R incantation directly in the terminal. However, wouldnāt it be great if we could automate this!
</p>
<p>
The first step in getting this working was to recognise that the R Infrastructure organisation has been working to make R-related GitHub Actions workflows available to users. This effort has been lead by Jim Hester and Jim has very helpfully provided a workflow example YAML file showing how one might go about rendering a <code>README.Rmd</code> file to <code>README.md</code> using the <strong>rmarkdown</strong> package.
</p>
<p>
Also, the <strong>usethis</strong> package has made it incredibly easy to get started using GitHub Actions; <strong>usethis</strong> provides <code>use_github_actions()</code> to set your package up to start using GitHub Actions to check your package builds without errors. Thereās also a <code>use_github-action()</code> function that can add workflows from the <code>r-lib/actions</code> repo to your package.
</p>
<p>
If you donāt have <strong>usethis</strong> installed, install it (<code>install.packages(āusethisā)</code>), then you can set your R package repo up to run <code>R CMD check</code> on your package on GitHubās servers by running
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">usethis</span><span class="o">::</span><span class="n">use_github_actions</span><span class="p">()</span></code></pre>
</figure>
<p>
in an R session in the package root folder. Running this will produce something like this
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">usethis</span><span class="o">::</span><span class="n">use_github_actions</span><span class="p">()</span><span class="w">
</span><span class="err">ā</span><span class="w"> </span><span class="n">Setting</span><span class="w"> </span><span class="n">active</span><span class="w"> </span><span class="n">project</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="s1">'/home/gavin/work/git/gratia/gratia'</span><span class="w">
</span><span class="err">ā</span><span class="w"> </span><span class="n">Creating</span><span class="w"> </span><span class="s1">'.github/'</span><span class="w">
</span><span class="err">ā</span><span class="w"> </span><span class="n">Adding</span><span class="w"> </span><span class="s1">'*.html'</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="s1">'.github/.gitignore'</span><span class="w">
</span><span class="err">ā</span><span class="w"> </span><span class="n">Creating</span><span class="w"> </span><span class="s1">'.github/workflows/'</span><span class="w">
</span><span class="err">ā</span><span class="w"> </span><span class="n">Writing</span><span class="w"> </span><span class="s1">'.github/workflows/R-CMD-check.yaml'</span><span class="w">
</span><span class="err">ā</span><span class="w"> </span><span class="n">Copy</span><span class="w"> </span><span class="n">and</span><span class="w"> </span><span class="n">paste</span><span class="w"> </span><span class="n">the</span><span class="w"> </span><span class="n">following</span><span class="w"> </span><span class="n">lines</span><span class="w"> </span><span class="n">into</span><span class="w"> </span><span class="s1">'/home/gavin/work/git/gratia/gratia/README.md'</span><span class="o">:</span><span class="w">
</span><span class="o"><!--</span><span class="w"> </span><span class="n">badges</span><span class="o">:</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="o">--></span><span class="w">
</span><span class="p">[</span><span class="o">!</span><span class="p">[</span><span class="n">R</span><span class="w"> </span><span class="n">build</span><span class="w"> </span><span class="n">status</span><span class="p">](</span><span class="n">https</span><span class="o">://</span><span class="n">github.com</span><span class="o">/</span><span class="n">gavinsimpson</span><span class="o">/</span><span class="n">gratia</span><span class="o">/</span><span class="n">workflows</span><span class="o">/</span><span class="n">R</span><span class="o">-</span><span class="n">CMD</span><span class="o">-</span><span class="n">check</span><span class="o">/</span><span class="n">badge.svg</span><span class="p">)](</span><span class="n">https</span><span class="o">://</span><span class="n">github.com</span><span class="o">/</span><span class="n">gavinsimpson</span><span class="o">/</span><span class="n">gratia</span><span class="o">/</span><span class="n">actions</span><span class="p">)</span><span class="w">
</span><span class="o"><!--</span><span class="w"> </span><span class="n">badges</span><span class="o">:</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">--></span></code></pre>
</figure>
<p>
which outlines the steps <strong>usethis</strong> has taken on your behalf. The last line prints out some text that you can paste into the <code>README.Rmd</code> to show a status badge for the GitHub Action; in this case it will show whether or not your package passed <code>R CMD check</code> without error.
</p>
<p>
This also nicely illustrates how you might set things up by hand of course, especially if you donāt want to run <code>R CMD check</code> on each push.
</p>
<p>
GitHub Actions workflows are configurations that describe the steps in the workflow and are stored in YAML files. These files should be located in a <code>.github/workflows</code> folder in the package root. If all you want to do is render a <code>README.Rmd</code> to <code>README.md</code> you could just as easily create this folder yourself. Iām not sure why <strong>usethis</strong> also creates a <code>.gitignore</code> containing <code>ā*.htmlā</code> in the <code>.github</code> folder, but if this is needed for what youāre doing, go ahead and create it too.
</p>
<p>
To get set-up quickly to render <code>README.Rmd</code> to markdown, you can now use <code>use_github_action(ārender-readme.yamlā)</code>. This will copy the <code>render-readme.yaml</code> file from <a href="https://github.com/r-lib/actions/tree/master/examples">r-lib/actions/examples</a> to <code>.github/workflows/render-readme.yaml</code>. Alternatively, you can <code>touch .github/workflows/render-readme.yaml</code> and add what you need by hand.
</p>
<p>
This is what the contents of <code>render-readme.yaml</code> look like, at the time of writing, if you used <strong>usethis</strong> to create it:
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"><span class="na">on</span><span class="pi">:</span>
<span class="na">push</span><span class="pi">:</span>
<span class="na">paths</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">README.Rmd</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">Render README</span>
<span class="na">jobs</span><span class="pi">:</span>
<span class="na">render</span><span class="pi">:</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">Render README</span>
<span class="na">runs-on</span><span class="pi">:</span> <span class="s">macOS-latest</span>
<span class="na">steps</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v2</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-r@v1</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-pandoc@v1</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install rmarkdown</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'install.packages("rmarkdown")'</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Render README</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'rmarkdown::render("README.Rmd")'</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Commit results</span>
<span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">git commit README.md -m 'Re-build README.Rmd' || echo "No changes to commit"</span>
<span class="s">git push origin || echo "No changes to commit"</span></code></pre>
</figure>
<p>
The first bit under <code>on:</code> controls when the workflow is triggered. The way the example workflow is set up means it will only be triggered <em>if</em> a file matching the path <code>README.Rmd</code> is included in the commit when pushed to the repo. Itās also worth noting that <em>until the workflow is actually triggered</em>, it wonāt show up in the <em>Actions</em> tab in your repo on GitHub ā this caused me no end of grief until I figured our this GitHub Actions <em>feature</em>. To trigger this workflow, you need to edit <code>README.Rmd</code>, add and commit those changes using <code>git</code>, and then push the changes to GitHub.
</p>
<p>
That didnāt suit my use case however; what if I change the package code in such a way that any output or plots produced by code in the <code>README.Rmd</code> would also change? In this case, I would have to needlessly tweak something in <code>README.Rmd</code> and push that change just to trigger rendering.
</p>
<p>
Thereās probably a better way to do this ā such as setting <code>paths:</code> to a wildcard that would match <em>any</em> <code>.R</code> file in the <code>R</code> folder so the workflow would be triggered on any change to the package code ā but to just get something up and running I changed the <code>on:</code> part to read:
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"><span class="na">on</span><span class="pi">:</span>
<span class="na">push</span><span class="pi">:</span>
<span class="na">branches</span><span class="pi">:</span> <span class="s">master</span></code></pre>
</figure>
<p>
which indicates that the workflow should run for any push to the <em>master</em> branch of the repo.
</p>
<p>
The top-level <code>name:</code> element is how your workflow will be listed in the Actions tab in your repo. Set this to something short but descriptive so it is easy to filter the various outputs from workflows that are run on the GitHub Actions service.
</p>
<p>
All workflows contain one or more <em>jobs</em>, listed under the <code>jobs:</code> element. In the example YAML file, there is a single job listed as <code>render:</code>, which has a name, <code>Render README</code>.
</p>
<p>
The <code>runs-on</code> element indicates what system the job will be run on; here is is a Mac OS system. Iām not sure why the <em>r-lib/actions</em> example workflows all run on Mac OS systems? Anyway, they work, so no need to change that unless you need something specific.
</p>
<p>
The <code>steps:</code> section is where the stages of the job are defined.
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"> <span class="na">steps</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v2</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-r@v1</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-pandoc@v1</span></code></pre>
</figure>
<p>
Each of the <code>uses:</code> elements pulls in some pre-existing worfklow steps that you can build to upon to bootstrap the solution you need. For example, the <code>actions/checkout@v2</code> workflow contains everything you need to checkout your repo and make it available to the current job. This is pretty fundamental; unless the GitHub Actions service can get at the code in your repo, it wonāt be able to do anything useful whatsoever.
</p>
<p>
The next two <code>uses:</code> are workflows provide by <em>r-lib/actions</em> that set up a working R installation (<code>r-lib/actions/setup-r@v1</code>) and the <strong>Pandoc</strong> library used by <strong>rmarkdown</strong> (<code>r-lib/actions/setup-pandoc@v1</code>).
</p>
<p>
After the <code>uses:</code> declarations, the YAML file includes a series of steps that describe commands that are run on the service. This is where the real action takes place.
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install rmarkdown</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'install.packages("rmarkdown")'</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Render README</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'rmarkdown::render("README.Rmd")'</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Commit results</span>
<span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">git commit README.md -m 'Re-build README.Rmd' || echo "No changes to commit"</span>
<span class="s">git push origin || echo "No changes to commit"</span></code></pre>
</figure>
<p>
Here we see three sets of commands that will be run
</p>
<ol type="1">
<li>
the first installs the <strong>rmarkdown</strong> package,
</li>
<li>
the second runs <code>rmarkdown::render()</code> on <code>README.Rmd</code> to render it, and
</li>
<li>
the third commits the rendered <code>README.md</code> file and pushes it to your repo, or echos a comment if no changes are needed.
</li>
</ol>
<p>
Notice how the <code>run:</code> element for the last step has a <code>|</code> after <code>run:</code>. This indicates that this particular step involves multiple lines of commands to be executed one after another.
</p>
<p>
If youāve not come across <code>Rscript</code> before, itās a way to use R like a scripting language, non-interactively. Here weāre using the <code>-e</code> flag to tell Rscript what R code to run, rather than passing it a <code>.R</code> to run.
</p>
<p>
Out of the box, these steps arenāt going to be very useful for R package maintainers if the <code>README.Rmd</code> uses anything other than the base R installation and recommended packages. At the very least you are going to want to also install the R package you are documenting in the <code>README.Rmd</code>, plus any other packages you need for the <code>Rmd</code> that might not be dependencies of the package in the repo.
</p>
<p>
In my case, I just needed to install the <strong>gratia</strong> package alongside <strong>rmarkdown</strong>, so I changed that <code>run:</code> element to be
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install rmarkdown</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'install.packages(c("rmarkdown", "gratia"))'</span></code></pre>
</figure>
<p>
I also decided to change the <code>rmarkdown::render()</code> call; by default this will generate HTML output by rendering the <code>.Rmd</code> first to <code>.md</code> and thence to <code>.html</code>. As we donāt need this latter step, I changed the <code>output_format</code> argument of <code>render()</code> to be <code>āmd_documentā</code>, so that element now looks like this
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Render README</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'rmarkdown::render("README.Rmd", output_format = "md_document")'</span></code></pre>
</figure>
<p>
Doing this means I donāt also generate a <code>README.html</code> file (which might be why the <code>.gitignore</code> was created by <strong>usethis</strong> earlier?); keeping the <code>.gitignore</code> canāt hurt given that it only excludes any <code>.html</code> files from a commit, so I left it alone.
</p>
<p>
I also modified the <em>commit</em> step too. The default assumes you already have a <code>README.md</code> in the repo and that this is the only file you want to add to the commit. If you render any plots in the <code>.Rmd</code>, then youāll also want to add those to the commit. So, I added an explicit <code>git add</code> line prior to the commit, and also simplified the latter
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Commit results</span>
<span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">git add README.md man/figures/README-*</span>
<span class="s">git commit -m 'Re-build README.Rmd' || echo "No changes to commit"</span>
<span class="s">git push origin || echo "No changes to commit"</span></code></pre>
</figure>
<p>
As you can see, I used a wildcard to catch any figures created by the render. In the <code>README.Rmd</code> I used a setup chunk to set the <code>fig.path</code> <strong>knitr</strong> option so that any plots were generated in the <code>man/figures</code> folder and had the prefix <code>README-</code> prepended to the file name:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">knitr</span><span class="o">::</span><span class="n">opts_chunk</span><span class="o">$</span><span class="n">set</span><span class="p">(</span><span class="w">
</span><span class="n">fig.path</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"man/figures/README-"</span><span class="w">
</span><span class="p">)</span></code></pre>
</figure>
<p>
The <code>man/figures</code> folder is a useful place to store figures generated like this as theyāll be carried along with your R package and available on CRAN, where the <code>README.md</code> file is also displayed if present. This folder is also used if you generate and include figures in the package documentation using <strong>roxygen</strong>, for example.
</p>
<p>
I used the prefix <code>README-</code> so that I could limit what I was adding in the <code>git add</code> step of the workflow. Iām always a bit nervous when staging files for a commit and never use <code>git commit -a</code> for example. This way I have a reasonable means of only adding plots that were created by rendering <code>README.Rmd</code>.
</p>
<p>
After these changes (and a few others as I was troubleshooting some issues) my workflow to render <code>README.Rmd</code> files looks like this
</p>
<figure class="highlight">
<pre><code class="language-yaml" data-lang="yaml"><span class="na">name</span><span class="pi">:</span> <span class="s">render readme</span>
<span class="c1"># Controls when the action will run</span>
<span class="na">on</span><span class="pi">:</span>
<span class="na">push</span><span class="pi">:</span>
<span class="na">branches</span><span class="pi">:</span> <span class="s">master</span>
<span class="na">jobs</span><span class="pi">:</span>
<span class="na">render</span><span class="pi">:</span>
<span class="c1"># The type of runner that the job will run on</span>
<span class="na">runs-on</span><span class="pi">:</span> <span class="s">macOS-latest</span>
<span class="na">steps</span><span class="pi">:</span>
<span class="c1"># Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v2</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-r@v1</span>
<span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">r-lib/actions/setup-pandoc@v1</span>
<span class="c1"># install packages needed</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install required packages</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'install.packages(c("rmarkdown","gratia"))'</span>
<span class="c1"># Render READEME.md using rmarkdown</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">render README</span>
<span class="na">run</span><span class="pi">:</span> <span class="s">Rscript -e 'rmarkdown::render("README.Rmd", output_format = "md_document")'</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">commit rendered README</span>
<span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
<span class="s">git add README.md man/figures/README-*</span>
<span class="s">git commit -m "Re-build README.md" || echo "No changes to commit"</span>
<span class="s">git push origin master || echo "No changes to commit"</span></code></pre>
</figure>
<p>
This is a first pass at getting something working; itās just occurred to me that the <code>git add</code> line probably needs to be linked with the <code>git commit</code> line so it only tries to commit if files were staged with <code>git add</code>.
</p>
<p>
It would also be good to try to cache the installed packages so the workflow doesnāt need to install everything for <strong>rmarkdown</strong> and <strong>gratia</strong> every time it is run. Thereās an example of caching packages in the <strong>pkgdown</strong> action <a href="https://github.com/r-lib/actions/blob/master/examples/pkgdown.yaml">r-lib/actions/examples/pkgdown.yaml</a>. However, I was running into issues related to the R 4.0.0 release and packages in the cache not getting refreshed even though they were out of date. So I removed that step from my <code>pkgdown.yaml</code> workflow, and as a result didnāt try to implement it for rendering <code>README.Rmd</code> files. Yet anywayā¦
</p>
<p>
For reference the workflow takes between two and three minutes to run on GitHub, even without package caching, which isnāt too bad, but rendering the <code>README.Rmd</code> locally takes only a few seconds, so thereās lots to be gained here by figuring out a reliable caching mechanism.
</p>
<p>
If you have implemented something similar for a GitHub Actions workflow, let me know in the comments below; this is all quite new to me and Iām interested in how other people might have tackled this. Now that I have this working reliably I only need to remember to <code>git pull</code> from GitHub more often to get the changes to <code>README.md</code>. The next issue I want to look at is getting the right <code>paths:</code> settings so the <code>README.Rmd</code> is rendered only when relevant files are changed in the package, not on every push to the repo.
</p>
<p>
Lastly, a big <strong>thank you</strong> to Jim Hester and everyone else whoās contributed to the R-related GitHub Actions workflows. This is an amazingly useful service for the R Community, and I for one am incredibly thankful that we have such helpful and knowledgeable people among us that are doing all this great work to make developing R packages that much easier.
</p>
What evaluating Discovery Grants for the last three years has taught me
Gavin L. Simpson
2020-02-26T00:00:00-06:00
2020-02-26T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2020/02/26/what-three-years-evaluating-discovery-grants-taught-me/
<p>
For the last three years I have been a member of NSERCās Discovery Grant Evaluation Group for Ecology and Evolution (thatās 1503 in NSERC-speak). In that time Iāve evaluated over 130 Discovery Grant submissions, read the same number of Canadian CCVs, and even chaired a few evaluations. This is what I learned, through this process, about writing a successful Discovery Grant.
</p>
<p>
For the last three years I have been a member of NSERCās Discovery Grant Evaluation Group for Ecology and Evolution (thatās 1503 in NSERC-speak). In that time Iāve evaluated over 130 Discovery Grant submissions, read the same number of Canadian CCVs, and even chaired a few evaluations. This is what I learned, through this process, about writing a successful Discovery Grant.
</p>
<p>
Discovery Grants (hereafter DGs) are an odd fish; theyāre programme grants, not projects, intended to fund the next five years of an applicantās research programme in the natural sciences or engineering. They are framed around a few short-term objectives against which streams of activity are proposed to address the long-term goals of the research programme. They describe activities that will be completed by Highly Qualified Personal (HQP) ā NSERC speak for basically anyone that receives training from the applicant that isnāt leading their own research programme ā and the environment in which and philosophy by which that training will take place. Finally, DGs have relatively high success rates ā around 60% depending on what group of applicants you fall into ā but are typically of low monetary amounts<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>, and as such applicants will rarely get anywhere near the amount of money they request for the proposed work that is budgeted for in the proposal.
</p>
<p>
DGs are evaluated on three key components:
</p>
<ol type="1">
<li>
Excellence of the Researcher (EoR) ā how highly is the applicant rated in terms of research excellence, accomplishments, and service?
</li>
<li>
Merit of the Proposal (MoP) ā how well is the applicantās proposed programme of work evaluated?, and
</li>
<li>
Training of HQP (confusingly just HQP) ā how highly is the past training record and proposed training plan philosophy rated?
</li>
</ol>
<p>
Each of these components is assigned a rating (from highest to lowest)
</p>
<ul>
<li>
Exceptional
</li>
<li>
Outstanding
</li>
<li>
Very Strong
</li>
<li>
Strong
</li>
<li>
Moderate
</li>
<li>
Insufficient
</li>
</ul>
<p>
Each rating is described by the Merit Indicators in what NSERC and Evaluation Group members all call āThe Gridā. <a href="https://www.nserc-crsng.gc.ca/_doc/Professors-Professeurs/DG_Merit_Indicators_eng.pdf">The Grid</a> itself is a single sheet of paper with brief descriptions of what the Evaluation Group is looking for to assign a proposal to each rating for each of the three components described above. The Grid is supported by the <a href="https://www.nserc-crsng.gc.ca/NSERC-CRSNG/Reviewers-Examinateurs/IntroPRManual-IntroManuelEP_eng.asp">Peer Review Manual</a>, which has fuller descriptions of what Evaluation Group members are looking for when they assign ratings.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/my-trusty-copy-of-the-grid.jpg" alt="My copy of The Grid from the 2020 Discovery Grant Competition week in Ottawa, February 2020" />
<figcaption>
My copy of <em>The Grid</em> from the 2020 Discovery Grant Competition week in Ottawa, February 2020
</figcaption>
</figure>
<p>
The Grid and the merit indicators are exceptionally important in evaluating DGs. They ensure that all applicants are treated fairly and objectively in the same way. They focus Evaluation Group memberās assessments on the criteria that NSERC are interested in, not each memberās individual criteria for what makes a good proposal.
</p>
<h2 id="how-we-practically-assess-discovery-grants">
How we practically assess Discovery Grants
</h2>
<p>
If you are familiar with the conference panel review system NSERC uses to evaluate DGs, you might want to skip this next section and <a href="#what-makes-a-good-discovery-grant">jump to the part</a> where I explain what we, as Evaluation Group members, are looking in a good DG.
</p>
<p>
Typically, each DG is read by five members of the Evaluation Group ā known as the <em>Readers</em> ā each of whom will ultimately provide ratings for the three assessed components. The final rating for each component is the <em>median</em> of the ratings from the five Readers and this final rating determines which funding bin the DG application ends up in. Evaluation Group members donāt decide how much money each DG is awarded; the dollar amounts attached to each bin are ultimately determined by the NSERC staff in the weeks after the Evaluation Group has concluded itās activities, and depend on the available budget for each Evaluation Group.
</p>
<p>
The actual evaluation of each DG takes place during a single week in the middle of February<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>. For 1503, we had three rooms running almost continually from 0830 to 1700 each day in which the five readers for each DG would spend 15 minutes discussing the merits of each application before voting their final ratings. Each room has a Chair ā an Evaluation Group member who oversees each DG evaluation and facilitates the discussion ā and an NSERC Programme Officer ā who oversees the process and provides input on areas of procedure and policy and keeps the whole process on track. The Chair and the Programme Officer are there to ensure that all DGs are treated the same way; 15 mins for discussion, evaluated in terms of the The Grid, etc. From time to time, other Programme Officers, Team Leads, and other NSERC staff that oversee the different Evaluation Groups, and the overall Chair for 1503 would sit in on evaluations for periods of time to ensure fairness across the three 1503 rooms and across the the various Evaluation Groups.
</p>
<p>
The actual evaluation is a pretty frenetic affair. At the start of the 15 minute evaluation the name of the applicant is announced and the Chair asks if there are any Delays ā valid delays to an applicantās activities such as parental leave, illness, caring for a dependent, etc are taken into account when assessing EoR ā and any nomination for a DAS (Discovery Accelerator Supplement). I wonāt discuss DAS nominations here, but if there is a nomination the five Readers also need to discuss and vote on the DAS nomination within the 15 minute evaluation period; knowing that there is a nomination upfront ensures that the Chair leaves enough time for these additional deliberations.
</p>
<p>
Next each Reader, in turn, gives their preliminary ratings for the three components. Then the first Reader (R1) has 4ā5 minutes to justify their ratings. R1 will typically hit upon the main evidence supporting their evaluation and hence has a little longer to make their case. Then R2 has a couple of minutes to explain their scores; R2 will typically focus on areas where they might differ from R1 in terms of their rating, or provide examples of additional factors justifying their own rating if they agree with R1. Usually, the Chair will then briefly intercede, identifying areas of disagreement in the preliminary ratings so that R3, R4, and R5 can focus their brief comments (typically just a minute or 90 seconds each) on any areas of disagreement.
</p>
<p>
Once each Reader has given their comments and justification, the remaining time is given over to discussing areas where the Readers might disagree on the ratings. The aim here is not to come to consensus across the five Readers, but to ensure that sufficient consideration is given to differences of opinion among the Readers. Throughout, the room Chair will be making notes and will facilitate the discussion by referring the Readers to The Grid, trying to focus attention on the specific merit indicators based on their interpretation of the language being used by the Readers. The room Chair also ensures that each Reader has a chance to speak or comment so that everyoneās voice is heard. Once we hit 13 or 14 mins, the room Chair will bring the discussion to a close and ask the Readers to vote.
</p>
<p>
Voting is done on one of about eight laptops arranged around the room, and proceeds in private and anonymously; a Reader is not required to stick to their preliminary ratings, but is free to do so if they wish and nobody, not even the Programme Officer, knows the way an individual Reader ultimately votes<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a>. Once all the ratings have been entered, the final score (Outstanding-Strong-Strong for example for EoR-MoP-HQP) is announced by the Programme Officer. If needed, a few house-keeping activities are attended to (such as Messages to Applicants for anyone in receipt of a rating of Moderate or Insufficient on one or more component). Then the whole process starts again for the next applicant, often with one or more Readers changing up and often moving between rooms. If there is a DAS nomination, the whole evaluation described above takes place in about 11 or 12 minutes, leaving a couple of minutes for the DAS discussion and voting before the evaluation concludes.
</p>
<h2 id="what-makes-a-good-discovery-grant">
What makes a good Discovery Grant?
</h2>
<p>
Fifteen minutes doesnāt sound like a lot of time to evaluate a proposal, even one as short as a DG. It isnāt, but each Evaluation Group member will have spent the previous two months reading each of the DGs they were assigned (I had about 45 DGs to evaluate this year (2020), 4 of which were for other EGs), preparing notes to support their assessments, as well as taking part in calibration exercises. The aim of the 15 minute evaluation is to provide time for Readers to justify their ratings and consider the input of the other Readers before giving their final ratings. So, during the preceding two months, what were Readers looking for when evaluating a DG?
</p>
<p>
The advice below is just that; <em>my advice</em>. Nothing here is official NSERC policy or guidance. Treat what I write below accordinglyā¦
</p>
<h3 id="excellence-of-the-researcher">
Excellence of the Researcher
</h3>
<p>
Here, Evaluation Group members are looking at the applicant and their research and service activities, plus their recognitions and accomplishments.
</p>
<p>
Readers are looking at <em>inter alia</em>
</p>
<ul>
<li>
the publication record of the applicant; what papers the applicant has published, where, and what impact they had,
</li>
<li>
where the applicant has presented their work, to whom? Was it an invited talk or a keynote?
</li>
<li>
whether the applicant is on the editorial boards of any journals, or served on committees or served scholarly societies, organized conferences or conference sessions, workshops, or served as expert witnesses,
</li>
<li>
whether the applicant has received any recognitions, awards, etc,
</li>
<li>
the funding record of the applicant
</li>
<li>
etc.
</li>
</ul>
<p>
Weāre not bean-counting here; while a lot of this information is gleaned from the Canadian Common CV (CCCV), weāre trying to evaluate the <em>quality</em> of research outputs, not the <em>quantity</em>, plus the <em>quality</em> of the service to the research community. In this regard, the applicant can help their Readers by highlighting important outcomes of the applicantās work and <em>providing evidence</em> for impact in the <em>Most Significant Contributions</em> section of the the proposal. Most importantly, your Readers need <em>evidence</em> of the excellence or impact of your contributions; if you only quote bibliometric data at us, we arenāt going to be able to weigh that properly as evidence. Citation rates vary from (sub-)field to (sub-)field and your Readers are not all going to be familiar with the field in which you work. Help them understand how great you are by giving specific examples of impact; if your paper has influenced researchers in broader fields, tell us; if your work led to a new paradigm, explain how; if your work resulted in actionable conservation management outcomes, point out where; if a contribution led to a new collaboration, invitation to give a talk, or join a committee or working group, point this out.
</p>
<p>
Readers are not just looking at research activities; service to the research community is equally important so tell your Readers about the societues you serve, the committees you joined, the activities your organized or contributed toward.
</p>
<p>
When completing your <em>Most Significant Contributions</em> section, bear in mind that you donāt have to give five contributions, thatās just the maximum. If you have three themes to your contributions, present this information as three groups of papers/contributions and use the space youāre given accordingly.
</p>
<p>
As Evaluation Group members, weāre conscious that author order norms are not consistent across disciplines, and that many applicants will have publication records that reflect a high degree of collaboration in their research programme. This is fine and we really do want to give you credit for your contributions, but you need to explain this to us; Evaluation Group members are not allowed to give people the benefit of the doubt about researcher contributions. If you are regularly in the middle of many authors on your papers, or routinely donāt take the senior/first author position, then tell us why and explain your contributions to these papers, otherwise we have no evidence youāre leading research or what your contribution was.
</p>
<p>
You give this extra background information in the <em>Additional Information on Contributions</em> section of the proposal. Use this section, giving specific examples (you can reference your CCCV papers by number, e.g.Ā <code>[1]</code> or <code>[J1]</code> <code>[C2]</code> if you have papers and book chapters for example, in this section, but include a note to say what your system is so your readers know) to provide additional information on where you publish papers and why, and what your contributions were where it is not clear from typical norms (First/last author for example). You donāt have a lot of space in this section so use it well; assume your Readers know nothing about you and what author order means in terms of your contributions.
</p>
<h3 id="merit-of-the-proposal">
Merit of the Proposal
</h3>
<p>
In my experience this is the area where many applicants do themselves few favours.
</p>
<p>
It is important to realize that some ā if not most ā of your Readers are not going to be subject-matter experts in the area you are writing your proposal on. All your Readers will however know what constitutes good research design, clear exposition, etc. Write your proposal section with this in mind; youāre writing for researchers but not necessarily someone in your specific sub-field of ecology or evolution.
</p>
<p>
Write clearly and concisely; use your space well.
</p>
<p>
Readers are looking for four main things. First, weāre evaluating whether the research you propose is <em>original</em> and <em>innovative</em> and what we anticipate the <em>impact</em> on the applicantās (sub-)field will be. Thereās even a section where you can address <em>impact</em> that youāre asked to add (usually at the end of the <em>Proposal</em> section). Donāt oversell the impact of your work; not everything is going to be paradigm changing, but you can help yourself by clearly articulating what you anticipate the impact of this work will be and why.
</p>
<p>
Second, weāre looking to see if you have described the long-term goal of your research programme; this is the thing you envision working toward over two or more DG rounds. Readers will also be looking to see if your short-term objectives are <em>given</em>, <em>feasible</em>, and <em>how well they mesh with the long-term goal</em>.
</p>
<p>
Short-term objectives are the things you will work on in this DG proposal. As such, Readers need to understand how the objectives will help you make progress in achieving the long-term goal of your programme. We need to see that these objectives are not just clearly described but are <em>planned</em> and <em>well defined</em>. This is where good grant writing can help; the more clearly you articulate what the short-term objective are and how you intend to achieve them, the more highly you can score on MoP. What theoretical framework are you working under or plan to develop? What are the specific hypotheses you will test? Tie this back into the <em>impact</em> section so we can understand how attaining your objectives will lead to impact and advances in your field/area.
</p>
<p>
The third thing weāre looking for is how well the methods you propose to use will enable you to tackle the objectives. If you are doing experiments, tell me how many samples youāll collect, how many replicates (please donāt just say <code>n=3</code> and be done with it), what treatment levels youāll use and why those levels. If youāre doing observational work, tell me why you want to work where you propose to work, what the pressure gradient is and how youāll measure the pressure. If youāre working with species, why those species and not others? Why this system? Why are you using this method over competing methods? How will you analyse your data? (Donāt just rattle off a list of stats methods youāll apply!)
</p>
<p>
Think about the appropriateness of the techniques you plan to use because you will have Readers who are familiar with the methods and who will call you out if they are inappropriate or call into question whether you can achieve your objectives.
</p>
<p>
Detail helps, but it has to be balanced with the needs of other areas of the <em>Proposal</em> section. Use detail where needed to hit the Merit Indicators; methods should be <em>clearly described</em> (or <em>clearly defined</em> for Exceptional) and <em>appropriate</em> according to the grid. Try to think about what a non-expert might need to read in order to assess this.
</p>
<p>
The fourth main area is easy to resolve and doesnāt cost you any space in the <em>Proposal</em> section; you canāt get money from two or more sources for doing the same thing. The emphasis is on you, the applicant, to explain how what youāre asking for in the DG is distinct from other funding sources you hold or have applied for. There is a separate section, <em>Relationship to Other Research Support</em>, where you write to <em>each</em> of the grants <em>in progress</em> on your CCCV and explain how they differ from what you propose to do in the DG. If there is overlap explain how and demonstrated why youāre not asking for those funds; if you have funding from elsewhere to collect some data that youāll use in support of an activity in the DG, then explain this. Perhaps you have funding for 50 samples but your DG requires 200; state you are asking for an additional <em>150</em> samples in the DG ā and why you need these additional samples ā and only budget for 150 in the <em>Budget Justification</em> section. All of this also applies to funding you have applied for but, at the time of submitting your DG application, you donāt have a decision on.
</p>
<p>
If you are holding or applying for CIHR or SSHRC grants you <em>must</em> declare this ā thereās now a box to tick to indicate that you have or have applied for such funding ā <em>and</em> include the required budgetary details and descriptions of the grants. If you check the button, the Research Portal shouldnāt let you submit your DG without attaching the relevant information to your DG application. Check the instructions!
</p>
<p>
This is an incredibly important point. This is one of the few areas of the evaluation where Readers can instantly decide the entire MoP rates Insufficient (and effectively scupper your grant) regardless of how groundbreaking your work proposed research will be. If thereās uncertainty, you can be sure Readers will spot it and question it, usually ahead of time so that other NSERC people can be in the room to advise the Readers in their discussions. You <em>really</em> donāt want your Readers debating funding overlap instead of the cool science you propose to do ā take the time get this right and donāt just say thereās no overlap, explain why there isnāt!
</p>
<p>
As weāre evaluating the MoP, Readers will be looking for where the HQP you propose to train will fit in to the programme. Think carefully about the feasibility and appropriateness of the activities or projects you assign to particular HQP. If you propose to do something that requires a PhD student, donāt allocate it to an Honours student!
</p>
<p>
Here are a few more tips for things to do or avoid when writing your <em>Proposal</em> section:
</p>
<ul>
<li>
Donāt repeat verbatim things in the <em>Recent Progress</em> section that youāve already covered in the <em>Most Significant Contributions</em>. Make reference to the other section as needed.
</li>
<li>
Donāt spend too much space on the literature review; Readers and external reviewers will spot if you havenāt included recent research or ideas, but we donāt need page after page of review ā in the proposal section weāre evaluating what you plan to do not what you or someone else already did.
</li>
<li>
Clearly identify which HQP will do which activities. Try your hardest to simplify the way you refer to projects and HQP. Readers are going to have a hard time if you have <em>Project 1a ii)</em> assigned to MSc4, PhD1, and BSc4ā10 ā what was <em>Project 1a ii)</em> again? And what are those BSc people doing, and how are MSc4ās and PhD1ās contributions different?
</li>
<li>
Do use a figure or table if it helps articulate aspects of the proposed research.
</li>
<li>
Use a number citation system like that used in a Science or Nature paper, it will save you a lot of space in your 5-page limit to the proposal.
</li>
<li>
You can save space on references by referring to your CCCV publications by number and only include those extra references that arenāt on the CCCV on the reference list you can supply. A common technique is to state early on that refs 1ā33 refer to your CCCV and 34+ are listed on the references page, for example.
</li>
<li>
Donāt think you need to have loads of objectives and many projects under each objective ā successful proposals can have just a couple of objectives with a couple of well-described described projects assigned to each. Sometimes less really is more.
</li>
</ul>
<h3 id="training-of-highly-qualified-personnel">
Training of Highly Qualified Personnel
</h3>
<p>
NSERC, like the other Tri-Agencies, is invested in training highly qualified people and a successful DG application will have to hit a number of criteria to do well on the rating.
</p>
<p>
There are two areas that Readers consider here;
</p>
<ol type="1">
<li>
the applicantās past track record of training HQP, and
</li>
<li>
the applicantās training philosophy and training plan
</li>
</ol>
<p>
The past track record speaks to previous HQP that you have trained and the extent to which those HQP have moved on to successful positions that use the skills they learned. Again, this is not a numbers game and quality trumps quantity, but you do need to demonstrate a track record. If you are early in your career and donāt have much of record, be honest and include what you have, including current trainees, on the CCCV. In the <em>Past Contributions to HQP Training</em> you can explain your training record and point out if you have some past experience, perhaps informally as a post-doc; but remember the mantra and show us the <em>evidence</em>.
</p>
<p>
In your CCCV do indicate where your listed HQP are now and what they are doing. You can also discuss this in the <em>Past Contributions to HQP Training</em> section, highlighting particular past trainees, perhaps to indicate if those trainees got awards or prestigious scholarships. If a trainee withdraws from their programme, donāt leave it up to the Reader to infer why, tell us. This section is also a good place to indicate if HQP are publishing and to highlight HQP contributions to those publications on your CCCV. Also give numbers of presentations given by HQP and perhaps highlight an important talk that they gave or a best talk or poster award they may have received.
</p>
<p>
Your past contributions are also assessed in terms of the training environment you provide to HQP; exactly where in the various sections on philosophy, training plans, and past contributions to HQP you put this is up to you, but do describe the environment in which your HQP training takes place and what facilities and opportunities are afforded to HQP that you train. If there is a particularly innovative course or workshop run at your institution, tell you Readers about it.
</p>
<p>
The other half of the rating for HQP is based upon your approach to training (your <em>Training Philosophy</em>) and the training plans for individual HQP. Iāve already mentioned that it is important to clearly indicate which trainees are doing which aspects of the proposed research, and that you need to assign HQP to appropriate tasks given their career stage. This is where a clear <em>Proposal</em> section that ties in nicely to your <em>HQP Training Plan</em> section can really help you. Donāt duplicate extensive information in more than one section, but do refer between the <em>Proposal</em> and the <em>HQP Training Plan</em> sections.
</p>
<p>
The training plans should also include information about how you actively train HQP in the various lab, field, taxonomic, soft, and transferable skills appropriate to your lab or setting. Do you teach data analysis, or science communication? Do you have lab meetings and how often? Hereās were you describe these more generic items that cut across multiple HQP trainees. You need to have information on the individual training plans for specific HQP (this include the projects theyāll do in the <em>Proposal</em>) as well as on these more general skills.
</p>
<p>
Your <em>Training Philosophy</em> refers to your approach to HQP training. Are you hands-on or do you favour a looser working relationship with your HQP? Do you prefer a small lab or a larger lab of trainees? And how do you manage that; do you have senior HQP (PDFs) helping to train more junior members for example?
</p>
<p>
Everyone holds lab meetings, helps their HQP publish, and sends HQP to conferences so that they may present their research. What is it that you do that is unique or different?
</p>
<p>
The final component of the HQP section is the EDI (Equity, Diversity, Inclusion) statement, which is new this year as a requirement. It forms part of the <em>training philosophy and training plan</em> half of the HQP rating.
</p>
<p>
What are Readers looking for on EDI? First we are asked to look for some indication that the applicant understands what the barriers to entry and challenges in recruitment are for underrepresented groups in the applicantās particular field of research <em>and</em> at the applicantās institution. Again, provide evidence to support your assertions; reach out to your Faculty, Research Office, or EDI person/office at your institution to get specific information on challenges at your institution, and consult the literature or relevant scholarly societies for evidence to support your statement regarding your field of research.
</p>
<p>
Secondly, Readers will be looking for specific actions or activities that you have done, and or will do, to support recruitment of underrepresented groups to your lab and to provide all HQP that you supervise with an inclusive environment for their training. As always, give evidence and be specific, providing detail. Have you taken unconscious bias training? Are HQP positions advertised broadly with specific attempts to advertise via outlets that specialize in or cater to particular groups? Do you have a Code of Conduct for your lab?
</p>
<p>
In 2019 NSERC asked for the EDI statement to be included in the proposal though they didnāt require it and many people didnāt include anything on EDI in their DG application. This year it is a requirement and there are specific sections on The Grid that Readers can use to evaluate it. Itās a soft requirement though; you donāt need to include it, but if you donāt youāll get an Insufficient rating for that element of the <em>Training Philosophy and Research Training Plan</em> component of the overall HQP rating. That usually wonāt be enough to pull an applicant down one entire bin (i.e.Ā if everything else had you at a solid Strong for HQP then all else equal the missing EDI statement shouldnāt pull you down to Moderate) and also isnāt sufficient to render an overall rating of Insufficient for HQP either. Where it can make a difference is if you are borderline for a particular rating ā a low Strong rating could be pulled down to a Moderate of all or parts of the EDI criteria are missing, while a high Very Strong could get pulled up to an Outstanding rating if the applicant does a good job with the EDI statement.
</p>
<h3 id="early-career-researchers-and-hqp">
Early career researchers and HQP
</h3>
<p>
A note on ECRs; as Readers, we arenāt supposed to consider any element of the DG evaluation <em>in terms of the applicantās career stage</em>. This may seem to be unfair to ECRs, as how could they possibly have any track record of training HQP if they are just starting out in their first academic position<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a>. Well, it is unfair and NSERC recognizes this.
</p>
<p>
It is not uncommon for a ECR to warrant a rating of Insufficient for their record of past HQP training. However, as long as an ECR provides a good training plan and training philosophy section, this will be enough to pull them up to a Moderate rating overall for HQP. For ECRs <em>only</em> NSERC will fund down to the Strong-Strong-Moderate bin; assuming they rated well on their EoR and MoP sections an ECR will not be unfairly treated by a non-existent or relatively poor HQP track record.
</p>
<p>
Furthermore, currently NSERC gives ECRs that receive funding an extra $5,000 per year plus a one-time amount (the value of which I canāt quite recall just now) to help kick-start their DG careers, plus the option of a sixth year of funding at their level if they wish.
</p>
<p>
Other tips for preparing a good HQP section:
</p>
<ul>
<li>
Do follow the instructions and indicate HQP co-authors with a <code>*</code> on the CCCV.
</li>
<li>
The only presentations you should list on the CCCV are the ones where you were the presenting author.
</li>
<li>
Do <strong>not</strong> list presentations given by HQP as presenting authors on your CCCV, but do indicate if they are co-authors on any of the talks you presented, again with an <code>*</code>
</li>
<li>
Donāt use <em>Academic Advisor</em> for HQP to pad your numbers. If you do have a number of trainees where your supervision was not a strict Primary- or Co-supervision role, then you can use this <em>Academic Advisor</em> role but you must do a good job of explaining your role in the training of those HQP and what particular skills or training you contributed yourself. Donāt use this as a way to add HQP to your CCCV where you were on a supervisory committee without justifying this and giving evidence of your contributions as we all sit on graduate committees. If you went above and beyond as a committee member, then this might be a good reason to include that HQP on your CCCV, but you will need to clearly explain why your supervision was important.
</li>
<li>
Itās OK to not have identified HQP by name in the <em>Proposal</em> or <em>HQP Training Plan</em> sections, but do be clear when you refer to particular HQP so Readers can clearly identify who is doing what; use PhD1, MSc2 etc. instead.
</li>
<li>
Training is valued at all levels; it doesnāt matter if you havenāt trained any PhDs or MScs as particular departments and programmes do not offer graduate degrees. NSERC is fair to all institutions and rewards training activities at all levels.
</li>
</ul>
<h2 id="random-stuff">
Random stuff
</h2>
<p>
Iāve tried to outline above some of the key areas where DG applicants succeeded or rated poorly over the 130 odd DGs that I evaluated over the past three years. Bear in mind that Iām writing this just after the 2020 competition evaluations; NSERC my change the requirements and instructions in future years, so do confirm details with the NSERC website if youāre submitting in November 2020 or later.
</p>
<p>
Here are a few general points that apply broadly when preparing your DG application:
</p>
<ul>
<li>
<p>
Read the instructions! These are currently provided in a poor format on the NSERC website. Do print out the <a href="https://www.nserc-crsng.gc.ca/ResearchPortal-PortailDeRecherche/Instructions-Instructions/DG-SD_eng.asp">Instructions for Completing and Application</a> web page for the DG programme and highlight any specific instructions as theyāre often buried in the narrative text. Then be sure to revisit your highlights to ensure that are doing what NSERC has requested of you.
</p>
</li>
<li>
<p>
Print out <a href="https://www.nserc-crsng.gc.ca/_doc/Professors-Professeurs/DG_Merit_Indicators_eng.pdf">The Grid</a> and refer to it often when preparing your DG application. Write your proposal to The Grid; the terminology might be obtuse and the differences between ratings obscure, but if it asks for things to be <em>evident</em> to get a Strong or <em>clearly evident</em> to get a Very Strong, make sure a reasonable Reader will think you provided <em>clear</em> evidence for a given indicator.
</p>
</li>
<li>
<p>
Read the <a href="https://www.nserc-crsng.gc.ca/NSERC-CRSNG/Reviewers-Examinateurs/IntroPRManual-IntroManuelEP_eng.asp">Peer Review Manual</a>; itās tedious but it will help you prepare a DG application that is ready for Reader scrutiny if you take into account what it is that your Readers are required to do to assess your application. In particular, read the sections on the Merit Indicators as they provide more detail and nuance to the statements on the The Grid.
</p>
</li>
<li>
<p>
The CCCV software is appalling and it takes a long time to prepare a good CCCV for NSERC DGs. Start early and complete it fully, taking into account specific instructions NSERC provides to you.
</p>
</li>
<li>
<p>
There are things that you might have included on your generic CCCV that come through to your NSERC one that arenāt needed; donāt delete these, just print off the final version and then go through and see if everything that is shown needs to be there and if it doesnāt need to, exclude it in the NSERC version (you can un-check any individual entry of the CCCV to stop it being included on the NSERC Researcher CCCV). Examples of this might be extensive Journal Reviewer information; all DG applications review for journals so you might not want to include a detailed list of reviewing activities which might obscure more senior or important contributions such as reviewing for funding bodies or being on the editorial board of a journal.
</p>
</li>
<li>
<p>
Ask other researchers at your institution and colleagues at other institutions to read your DG application and give you feedback. Also get someone in your Research Office who is responsible for NSERC grants to read through and give you advice.
</p>
</li>
<li>
<p>
Your DG application is primarily evaluated by your five Readers. Those Readers will take into account the external reviews of your application, but your ratings will be primarily based on the Readersā evaluations. Donāt be surprised if your final ratings donāt mesh with the comments from the overly enthusiastic reviewer, who may not be as familiar with The Grid and the Merit Indicators as your Readers.
</p>
</li>
</ul>
<h2 id="final-thoughts">
Final thoughts
</h2>
<p>
What struck me most ā besides the general excellence of the applicants that I evaluated ā is just how much care has gone into ensuring that the process is fair to everyone. As an applicant, your grant is evaluated by five careful and knowledgeable Readers plus at least one external reviewer. The NSERC Programme Officers and other staff are exceptional and take pride in running a process that is fair to everyone given the policy restrictions in play. NSERC DGs value so much more than how many Nature or Science papers you have and how many HQP youāve trained. We might disagree over the extent to which the quality of other people, which is beyond the scope of the applicantās ability to affect, contributes to the rating for an individual grant, but given the policies that NSERC has pursued, everything that I have witnessed during my time on EG 1503 assures me that this is fair and inclusive process, rewarding a great many excellent researchers in Canada.
</p>
<p>
If you have questions about anything I have written above, please ask in the comments below or drop me an email; Iāll do my best to answer them. Also, nothing I wrote above is official NSERC policy; these comments are mine and mine alone, but they do reflect what I have observed and learned in evaluating many DGs these past few years.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
By way of example, my current DG is $29,000 a year for five years, which includes a top-up as I was an Early Career Researcher (ECR) when I applied, and the top amount possible in 1503 is in the region of $170,000 a year if you can attain the top bin of Exceptional-Exceptional-Exceptional.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
For 1503 and a number of Evaluation Groups; other Evaluation Groups meet at different times in February. 1506 (Geoscience) met in the first week of February, and 1507 (Computer Science) met the second week of February, for example.<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn3">
<p>
The Programme Officer knows the breakdown of the individual ratings but not the identity of who voted what.<a href="#fnref3" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn4">
<p>
ECRs are currently defined as being within five years of their first NSERC eligible position.<a href="#fnref4" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Pivoting tidily
Gavin L. Simpson
2019-10-25T00:00:00-06:00
2019-10-25T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2019/10/25/pivoting-tidily/
<p>
One of the fun bits of my job is that I have actual time dedicated to helping colleagues and grad students with statistical or computational problems. Recently Iāve been helping one of our Lab Instructors with some data that from their Plant Physiology Lab course. Whilst I was writing some R code to import the raw data for the lab from an Excel sheet, it occurred to me that this would be a good excuse to look at the new <code>pivot_longer()</code> and <code>pivot_wider()</code> functions from the <em>tidyr</em> package. In this post I show how these new functions facilitate common data processing steps; I was personally surprised how little data wrangling was actually needed in the end to read in the data from the lab.
</p>
<p>
One of the fun bits of my job is that I have actual time dedicated to helping colleagues and grad students with statistical or computational problems. Recently Iāve been helping one of our Lab Instructors with some data that from their Plant Physiology Lab course. Whilst I was writing some R code to import the raw data for the lab from an Excel sheet, it occurred to me that this would be a good excuse to look at the new <code>pivot_longer()</code> and <code>pivot_wider()</code> functions from the <em>tidyr</em> package. In this post I show how these new functions facilitate common data processing steps; I was personally surprised how little data wrangling was actually needed in the end to read in the data from the lab.
</p>
<p>
In the lab course the students conduct an experiment to study the effect of the plant hormone <em>gibberellin</em> on plant growth. Over a number of weeks the students apply gibberellic acid (in two concentrations) or daminozide, a gibberellic acid antagonist, to the tips of the leaves of pea plants that are grown in a growth chamber with a 16-hour photoperiod. The students work in groups, with some of the groups growing the wild-type cultivar, whilst others work with a mutant dwarf cultivar. Each group has six plants per treatment level, and every seven days the students measure the height of each plant and the number of internodes that each plant has. On the last day of the experiment the plants are harvested and their fresh weight measured.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/plant-phys-plants-in-growth-room.jpg" alt="The pea plants from the 2019 Plant Physiology Lab course, toward the end of the experimental period" />
<figcaption>
The pea plants from the 2019 Plant Physiology Lab course, toward the end of the experimental period
</figcaption>
</figure>
<p>
Originally the data were recorded in a less than satisfactory way ā letās just say the original data sheets would have been good candidates for one of Jenny Bryanās talks on spreadsheets. After being cleaned up a bit, we have something that looks like this in Excel
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/pivoting-tidily-raw-data.png" alt="Raw data in the Excel Workbook" />
<figcaption>
Raw data in the Excel Workbook
</figcaption>
</figure>
<p>
This isnāt perfect as we have data in the column names ā the numbers after the colons are the day of observation ā but it is a pretty simple layout for the students to complete, and this is how we decided to ask the students to record the data during the 2019 lab course, so this is what we have to work with going forward.
</p>
<p>
Ultimately we want to be able to refer to columns named <code>height</code>, <code>internodes</code>, etc depending on the statistical analysis the students will do, and weāre going to need a column with the observation days in it.
</p>
<h2 id="pivoting">
Pivoting
</h2>
<p>
If youāre not familiar with pivoting, it is important to realize that we can store the same data in a wide rectangle or a long (or tall) rectangle
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/original-dfs-tidy.png" alt="Examples of wide and long representations of the same data. Source: Garrick Aden-Buieās (@grrrck) Tidy Animated Verbs" />
<figcaption>
Examples of <em>wide</em> and <em>long</em> representations of the same data. Source: Garrick Aden-Buieās (<a href="https://twitter.com/grrrck"><span class="citation" data-cites="grrrck">(<span class="citeproc-not-found" data-reference-id="grrrck"><strong>???</strong></span>)</span></a>) <a href="https://github.com/gadenbuie/tidyexplain">Tidy Animated Verbs</a>
</figcaption>
</figure>
<p>
The same information is stored in both the long and wide representations, but the two representations differ in how useful they are for certain types of operation or how easily they can be used in a statistical analysis. Itās also worth noting that there are more than just long or wide representations of the data; as weāll see shortly, the long representation of the Plant Physiology Lab data is too general and weāll need to arrange the data in a slightly wider form.
</p>
<p>
Moving between long and wide representations is known as <em>pivoting</em>. The animation below show the general idea of how the cells in one format are rearranged into the other format, with the relevant metadata that doesnāt get rearranged being extended or reduced as needed so we donāt loose any information.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/tidyr-longer-wider.gif" alt="Pivoting between wide and long representations of the same data. Source: Garrick Aden-Buieās (@grrrck) Tidy Animated Verbs modified by Mara Averick (@dataandme)" />
<figcaption>
Pivoting between <em>wide</em> and <em>long</em> representations of the same data. Source: Garrick Aden-Buieās (<a href="https://twitter.com/grrrck"><span class="citation" data-cites="grrrck">(<span class="citeproc-not-found" data-reference-id="grrrck"><strong>???</strong></span>)</span></a>) <a href="https://github.com/gadenbuie/tidyexplain">Tidy Animated Verbs</a> modified by Mara Averick (<a href="https://twitter.com/dataandme"><span class="citation" data-cites="dataandme">(<span class="citeproc-not-found" data-reference-id="dataandme"><strong>???</strong></span>)</span></a>)
</figcaption>
</figure>
<p>
With the lab data I showed earlier, weāre going to need to pivot from the original wide format into a longer format ā just as the animation above shows. As we want to output an object that is <em>longer</em> than the input we will use the <code>pivot_longer()</code> function.
</p>
<p>
To start we will need to import the data from the <code>.xls</code> sheet, which Iāll do using the <em>readxl</em> package
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'curl'</span><span class="p">)</span><span class="w"> </span><span class="c1"># download files</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'readxl'</span><span class="p">)</span><span class="w"> </span><span class="c1"># read from Excel sheets</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'tidyr'</span><span class="p">)</span><span class="w"> </span><span class="c1"># data processing</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w"> </span><span class="c1"># mo data processing</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'forcats'</span><span class="p">)</span><span class="w"> </span><span class="c1"># mo mo data processing</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w"> </span><span class="c1"># plotting</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="c1">## Load Data</span><span class="w">
</span><span class="n">tmp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">curl_download</span><span class="p">(</span><span class="s2">"https://github.com/gavinsimpson/plant-phys/raw/master/f18ph.xls"</span><span class="p">,</span><span class="w"> </span><span class="n">tmp</span><span class="p">)</span><span class="w">
</span><span class="n">plant</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_excel</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span><span class="w"> </span><span class="n">sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<p>
We have to download the data first ā which I do using <code>curl_download()</code> from the <em>curl</em> package ā because <code>read_excel()</code> doesnāt currently know how to read from URLs at the moment.
</p>
<p>
Now we have our plant data within R, stored in a data frame
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plant</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 24 x 12
treatment cultivar plantid `height:0` `internodes:0` `height:7`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 control wt 1 235 4 525
2 control wt 2 182 3 391
3 control wt 3 253 3 452
4 control wt 4 151 3 350
5 control wt 5 195 3 335
6 control wt 6 187 4 190
7 ga10 wt 1 250 4 458
8 ga10 wt 2 220 4 345
9 ga10 wt 3 180 2 300
10 ga10 wt 4 230 4 510
# ā¦ with 14 more rows, and 6 more variables: `internodes:7` <dbl>,
# `height:14` <dbl>, `internodes:14` <dbl>, `height:21` <dbl>,
# `internodes:21` <dbl>, `freshwt:21` <dbl></code></pre>
</figure>
<p>
To go to the long representation we have to tell <code>pivot_longer()</code> a couple of bits of information
</p>
<ul>
<li>
the name of the object to pivot,
</li>
<li>
which columns contain the data we want to pivot (or alternatively which columns not to pivot if that is easier),
</li>
<li>
the <em>name</em> we want to call the new column that will contain the <em>variable name</em> information from the original data, and
</li>
<li>
optionally, the name of the new column that will contain the data values. The default is to name this column <code>value</code> so you donāt need to change this if youāre happy with that.
</li>
</ul>
<p>
So, to get our wide plant data into a longer format we would do this
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pivot_longer</span><span class="p">(</span><span class="n">plant</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"variable"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 216 x 5
treatment cultivar plantid variable value
<chr> <chr> <dbl> <chr> <dbl>
1 control wt 1 height:0 235
2 control wt 1 internodes:0 4
3 control wt 1 height:7 525
4 control wt 1 internodes:7 5
5 control wt 1 height:14 810
6 control wt 1 internodes:14 10
7 control wt 1 height:21 1090
8 control wt 1 internodes:21 14
9 control wt 1 freshwt:21 7.2
10 control wt 2 height:0 182
# ā¦ with 206 more rows</code></pre>
</figure>
<p>
The <code>-(1:3)</code> is short-hand for excluding the first three columns of <code>plant</code> from the pivot. Here, weāre creating a new variable called (imaginatively!) <code>variable</code>. As you can see we now have our data in a much longer representation, with a single column containing all of the observations that this group of students made.
</p>
<p>
However, we have a bit of a problem: we have the added complication that some of the column names contain actual data that we want to use. While we have a column containing this information ā it is not lost ā the observation day or variable name information is not directly accessible in this format. What we could do is split the strings in this new <code>variable</code> column on <code>ā:ā</code> and form two new columns from there.
</p>
<p>
Thankfully, this is such a common operation that <code>pivot_longer()</code> (and itās predecessor, <code>gather()</code>) can do this for you ā all you have to do is tell <code>pivot_longer()</code> what character to split on, and what names you want for the columns that result from splitting the strings up.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pivot_longer</span><span class="p">(</span><span class="n">plant</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">":"</span><span class="p">,</span><span class="w"> </span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"variable"</span><span class="p">,</span><span class="s2">"day"</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 216 x 6
treatment cultivar plantid variable day value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 control wt 1 height 0 235
2 control wt 1 internodes 0 4
3 control wt 1 height 7 525
4 control wt 1 internodes 7 5
5 control wt 1 height 14 810
6 control wt 1 internodes 14 10
7 control wt 1 height 21 1090
8 control wt 1 internodes 21 14
9 control wt 1 freshwt 21 7.2
10 control wt 2 height 0 182
# ā¦ with 206 more rows</code></pre>
</figure>
<p>
The changes we made above were to specify <code>names_sep</code> with the correct separator, and we pass a vector of new column names to <code>names_to</code> rather than the single name we provided previously.
</p>
<p>
Those of you with good eyes may have noticed another problem that we will encounter if we stopped here. The <code>day</code> variable that was just created is stored as a character vector. It is likely that weāll want this information stored as a number if weāre going to analyze the data. We can do the required conversion within <code>pivot_longer()</code> call by specifying what the developers have started calling a <em>prototype</em> across many of the <em>tidyverse</em> packages. A prototype is an object that has the same properties that you want objects built from that prototype to take. Here we want the <code>day</code> variable as a column of integer numbers, so we set the prototype for this vector to <code>integer()</code> using the <code>names_ptypes</code> argument
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plant</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pivot_longer</span><span class="p">(</span><span class="n">plant</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">),</span><span class="w"> </span><span class="n">names_sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">":"</span><span class="p">,</span><span class="w"> </span><span class="n">names_to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"variable"</span><span class="p">,</span><span class="s2">"day"</span><span class="p">),</span><span class="w">
</span><span class="n">names_ptypes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">day</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">integer</span><span class="p">()))</span><span class="w">
</span><span class="n">plant</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 216 x 6
treatment cultivar plantid variable day value
<chr> <chr> <dbl> <chr> <int> <dbl>
1 control wt 1 height 0 235
2 control wt 1 internodes 0 4
3 control wt 1 height 7 525
4 control wt 1 internodes 7 5
5 control wt 1 height 14 810
6 control wt 1 internodes 14 10
7 control wt 1 height 21 1090
8 control wt 1 internodes 21 14
9 control wt 1 freshwt 21 7.2
10 control wt 2 height 0 182
# ā¦ with 206 more rows</code></pre>
</figure>
<p>
Notice that we pass <code>names_ptypes</code> a <em>named</em> list of prototypes, with the list name matching one or more of the variables listed in <code>names_to</code>.
</p>
<p>
Now we have successfully wrangled the data into a long format and recovered the information hidden in the column names of the original data file. However, as it stands, we canāt easily use the data in this format in a statistical model. We want the students on the course to analyze the data to estimate what effects the treatments have on the height of the plants over the course of the experiment. With the data in this long format we donāt have a variable <code>height</code> containing just the height of the plants that we can refer to in a linear model say.
</p>
<p>
What we want is to create new columns for <code>height</code>, <code>internodes</code> and <code>freshwt</code> and pivot the <code>value</code> data out into those columns. As weāre adding columns weāre making the data wider, so we can use the <code>pivot_wider()</code> function to do what we want. Now we need to tell <code>pivot_wider()</code>
</p>
<ul>
<li>
where to take the <em>names</em> of the new variables that are going to be created <strong>from</strong> ā here thatās the <code>variable</code> column, and
</li>
<li>
where to take the <em>data</em> values <strong>from</strong> that are going to be put into these new columns ā here, thatās the <code>value</code> column
</li>
</ul>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plant</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pivot_wider</span><span class="p">(</span><span class="n">plant</span><span class="p">,</span><span class="w"> </span><span class="n">names_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">,</span><span class="w"> </span><span class="n">values_from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="n">plant</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 96 x 7
treatment cultivar plantid day height internodes freshwt
<chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 control wt 1 0 235 4 NA
2 control wt 1 7 525 5 NA
3 control wt 1 14 810 10 NA
4 control wt 1 21 1090 14 7.2
5 control wt 2 0 182 3 NA
6 control wt 2 7 391 5 NA
7 control wt 2 14 615 9 NA
8 control wt 2 21 810 12 3.8
9 control wt 3 0 253 3 NA
10 control wt 3 7 452 6 NA
# ā¦ with 86 more rows</code></pre>
</figure>
<p>
As with other <em>tidyverse</em> package, we donāt have to quote the names of the columns we want to pull data from.
</p>
<p>
There are a couple of other things we need to do to make the data fully useful:
</p>
<ol type="1">
<li>
it would be helpful to have a unique identifier for each individual plant ā currently the <code>plantid</code> is just the values <code>1:6</code> repeated for each treatment group,
</li>
<li>
it would also be good practice to convert <code>treatment</code> into a factor, and to set the control treatment as the reference level against which the other treatment levels will be compared ā if we didnāt do that, the <code>b9</code> level (daminozide treatment) would be the reference level
</li>
</ol>
<p>
We can do those data processing steps quite easily now we have the data imported and arranged nicely the way we want them
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plant</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">plant</span><span class="p">,</span><span class="w">
</span><span class="n">id</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">cultivar</span><span class="p">,</span><span class="w"> </span><span class="s2">"_"</span><span class="p">,</span><span class="w"> </span><span class="n">treatment</span><span class="p">,</span><span class="w"> </span><span class="s2">"_"</span><span class="p">,</span><span class="w"> </span><span class="n">plantid</span><span class="p">),</span><span class="w">
</span><span class="n">treatment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fct_relevel</span><span class="p">(</span><span class="n">treatment</span><span class="p">,</span><span class="w"> </span><span class="s1">'control'</span><span class="p">))</span><span class="w">
</span><span class="n">plant</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 96 x 8
treatment cultivar plantid day height internodes freshwt id
<fct> <chr> <dbl> <int> <dbl> <dbl> <dbl> <chr>
1 control wt 1 0 235 4 NA wt_control_1
2 control wt 1 7 525 5 NA wt_control_1
3 control wt 1 14 810 10 NA wt_control_1
4 control wt 1 21 1090 14 7.2 wt_control_1
5 control wt 2 0 182 3 NA wt_control_2
6 control wt 2 7 391 5 NA wt_control_2
7 control wt 2 14 615 9 NA wt_control_2
8 control wt 2 21 810 12 3.8 wt_control_2
9 control wt 3 0 253 3 NA wt_control_3
10 control wt 3 7 452 6 NA wt_control_3
# ā¦ with 86 more rows</code></pre>
</figure>
<p>
Here I just pasted together the <code>cultivar</code>, <code>treatment</code> and <code>plantid</code> information into a unique id for each individual plant. This wonāt be used directly by the students in any analysis they do as this is a second year course and they donāt know about mixed models (yet), but it is handy to have this <code>id</code> available for plotting. The <code>treatment</code> variable is converted to a factor and the reference level set to be <code>ācontrolā</code> using the <code>fct_relevel()</code> function from the <em>forcats</em> package.
</p>
<p>
The students will do one other step before proceeding to look at the data ā each sheet in the <code>.xls</code> file contains observations from a single group and hence a single cultivar, and we want the students to compare cultivars. So they will repeat the steps above to import a second sheet of data containing data from the cultivar they didnāt work with, and then stick the two data sets together. But Iāll spare you having to repeat that.
</p>
<p>
If youāre interested, this is what the data look like, for a single cultivar and single group
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">plant</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">day</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">height</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">treatment</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Height (mm)'</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Day'</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Treatment'</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/pivoting-tidily-plot-1.png" alt="Plot of the plant growth data" />
<figcaption>
Plot of the plant growth data
</figcaption>
</figure>
<p>
(and now you can see why I needed a unique plant identifier even though the students will essentially ignore this clustering in the data when the analyse it.)
</p>
<p>
The <code>.xls</code> file we downloaded at the start of the script contains multiple sheets all formatted the same way, so we could pull in all the data into one big analysis if you wanted, but in the lab weāre just getting the students one set of wild-type and mutant cultivars. Iām grateful to <a href="https://www.uregina.ca/science/biology/people/instructors/davis-maria.html">Dr.Ā Maria Davis</a>, the lab instructor for the course, for making the data from the course available to anyone who wants to use it ā if you do use it, be sure to give Maria and the 2018 cohort of BIOL266 Plant Physiology students at the University of Regina an acknowledgement.
</p>
<p>
If youāre interested in the statistical analyses that weāll be getting the students to do in the lab, I have an (at the time of writing this, almost finished) <code>Rmd</code> file in the <a href="https://github.com/gavinsimpson/plant-phys">GitHub repo</a> for the lab course with all the instructions. Itās pretty simple ANOVA and ANCOVA analyses, but we do get the students to do <em>post hoc</em> testing using the excellent <em>emmeans</em> package, if youāre interested.
</p>
<p>
Finally, none of the data wrangling I did above is that complex, and I certainly didnāt need to use <em>tidyr</em> and <em>dplyr</em> etc to achieve the result I wanted. It is quite trivial to do this pivoting and wrangling in base R; we could just uses the <code>reshape()</code> function, <code>strplit()</code>, etc. However, if youāve ever used <code>reshape()</code> youāll know that the argument names for that function make no sense to anyone except perhaps the person that wrote the function. The real advantage of doing the wrangling using <em>tidyr</em> and <em>dplyr</em> is that we end up with code that is much more easy to read and understand, which is very important for students on these courses, who will have had little to no exposure to programming and related data science techniques.
</p>
<p>
Anyway, happy pivoting!
</p>
radian: a modern console for R
Gavin L. Simpson
2019-06-18T00:00:00-06:00
2019-06-18T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2019/06/18/radian-console-for-r/
<p>
Whenever Iām developing R code or writing data wrangling or analysis scripts for research projects that I work on I use <em>Emacs</em> and its add-on package <a href="https://ess.r-project.org/"><em>Emacs Speaks Statistics</em></a> (<em>ESS</em>). Iāve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldnāt call myself an <em>Emacs</em> expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my go-to methods for configuring <em>Emacs+ESS</em> were failing; indentation was all over the shop, the smart <code>_</code> stopped working or didnāt work as it had for over a decade, syntax highlighting of R-related files, like <code>.Rmd</code> was hit and miss, and <em>polymode</em> was just a mystery to me. Configuring <em>Emacs+ESS</em> was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasnāt fun any more. So, in a fit of pique following one to many reconfiguration sessions of <em>Emacs+ESS</em>, I went in search of some greener grass. During that search I came across <a href="https://github.com/randy3k/radian">radian</a>, a neat, attractive, simple console for working with R.
</p>
<p>
Whenever Iām developing R code or writing data wrangling or analysis scripts for research projects that I work on I use <em>Emacs</em> and its add-on package <a href="https://ess.r-project.org/"><em>Emacs Speaks Statistics</em></a> (<em>ESS</em>). Iāve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldnāt call myself an <em>Emacs</em> expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my go-to methods for configuring <em>Emacs+ESS</em> were failing; indentation was all over the shop, the smart <code>_</code> stopped working or didnāt work as it had for over a decade, syntax highlighting of R-related files, like <code>.Rmd</code> was hit and miss, and <em>polymode</em> was just a mystery to me. Configuring <em>Emacs+ESS</em> was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasnāt fun any more. So, in a fit of pique following one to many reconfiguration sessions of <em>Emacs+ESS</em>, I went in search of some greener grass. During that search I came across <a href="https://github.com/randy3k/radian">radian</a>, a neat, attractive, simple console for working with R.
</p>
<p>
Written by <a href="https://github.com/randy3k">Randy Lai</a>, <em>radian</em> is a cross-platform console for R that provides code completion, syntax highlighting, etc in a neat little package that runs in a shell or terminal, such as Bash. Iām someone who fires up multiple terminals every day to run some bit of R code, to show a student how to do something, to quickly check on argument names or such like, or prepare an answer to a question on <a href="https://stackoverflow.com">stackoverflow</a> or <a href="https://stats.stackexchange.com">crossvalidated</a>. Running R in a terminal after using an IDE/environment like <em>Emacs+ESS</em> or <em>RStudio</em> is an exercise in time travel; all those little helpful editing tools the IDE provides are missing and youāre coding like it was the 1980s all over again. <em>radian</em> changes all that.
</p>
<p>
<em>radian</em> is a Python application so to run it youāll need a python stack installed. Youāll also need a relatively recent version of R (ā„ 3.4.0). Using <code>pip</code>, the python package installer, installing <em>radian</em> is straightforward. Python v3 is recommended and on Fedora this mean I had to install using
</p>
<pre><code>pip-3 install --user radian</code></pre>
<p>
The <code>āuser</code> flag does an user install, which sets the installation location to be inside your home directory. Once installed, you can start <em>radian</em> by simply typing the application name and hitting enter
</p>
<pre><code>radian</code></pre>
<p>
A nice configuration tip included in the <em>radian</em> <code>README.md</code> is to alias the <code>radian</code> command to <code>r</code>, so that running <code>R</code> runs the standard <em>R</em> console, while running <code>r</code> starts <em>radian</em>. On Fedora, you configure this alias in your <code>~/.bashrc</code> file
</p>
<pre><code>alias r="radian"</code></pre>
<p>
Having started <em>radian</em> youāll see something like this
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/radian-startup.png" alt="radian at start-up running in a bash shell on Fedora" />
<figcaption>
<em>radian</em> at start-up running in a <em>bash</em> shell on Fedora
</figcaption>
</figure>
<p>
<em>radian</em> starts up with a simple statement of the R version running in <em>radian</em> and the platform (OS) itās running on; so is it just a less-verbose version of the standard R console? The <em>radian</em> prompt hints at the greater capabilities however.
</p>
<p>
Code completion is a nice addition; yes, you have some form of code completion in the standard R console but in <em>radian</em> we have a more <em>RStudio</em> or <em>Emacs+ESS</em>-like experience with a drop-down menu for object, function, argument, and filename completion. To activate this you start typing, hit <kbd>Tab</kbd> and the relevant completions pops up. Hit <kbd>Tab</kbd> again or press the down cursor and you can scroll through the potential completions.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/radian-completion.png" alt="Code completion in radian" />
<figcaption>
Code completion in <em>radian</em>
</figcaption>
</figure>
<p>
We also get nice syntax highlighting of R code using the colour schemes from <a href="https://help.farbox.com/pygments.html"><em>pygments</em></a>:
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/radian-syntax-highlighting.png" alt="Syntax highlighting in radian using the monokai theme" />
<figcaption>
Syntax highlighting in <em>radian</em> using the <em>monokai</em> theme
</figcaption>
</figure>
<p>
And, if youāre copying & pasting code into the terminal or piping code in from a editor with an embedded terminal (thatās running <em>radian</em>) then you also get rather handy multiline editing. Pressing the up cursor <kbd>ā</kbd> will retrieve the previous set of commands pasted or piped into <em>radian</em>, and repeatedly pressin <kbd>ā</kbd> will scroll back through the history. If you want to edit a set of R calls, instead of pressing <kbd>ā</kbd> again, press <kbd>ā</kbd> to enter the chunk of code and then you can move around among the lines using the cursor keys, editing as you see fit. Hitting enter will run the entire chunk of code for you, edits and all:
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/radian-multiline-editing.gif" alt="Multiline editing a ggplot call in radian" />
<figcaption>
Multiline editing a <em>ggplot</em> call in <em>radian</em>
</figcaption>
</figure>
<p>
You can configure aspects of the behaviour of <em>radian</em> via <code>options()</code> in your <code>.Rprofile</code>. The options Iām currently using on the computer used for the screenshots in this post are:
</p>
<pre><code>options(radian.auto_indentation = FALSE)
options(radian.color_scheme = "monokai")</code></pre>
<p>
but on my laptop Iām currently using
</p>
<pre><code># auto match brackets and quotes
options(radian.auto_match = TRUE)
# auto indentation for new line and curly braces
options(radian.auto_indentation = TRUE)
options(radian.tab_size = 4)
# timeout in seconds to cancel completion if it takes too long
# set it to 0 to disable it
options(radian.completion_timeout = 0.05)
# insert new line between prompts
options(radian.insert_new_line = FALSE)</code></pre>
<p>
The last option is something Iām not sure about yet; as you can see in the screenshots, thereās a new line between the prompts, which makes it super easy to read the R code youāve entered, but with the font Iām currently using (<a href="https://github.com/be5invis/Iosevka">Iosevka</a>) things look a bit too spread out. Setting <code>radian.insert_new_line = FALSE</code>, as I have it on the laptop results in more standard behaviour but it can feel a little cramped. Iāll probably play with both options and see which I like best after a few more weeks of use.
</p>
<p>
You can also define shortcuts. This is useful for entering the assignment operator <code><-</code>, which I have bound to <kbd>Alt</kbd> + <kbd>-</kbd> using
</p>
<pre><code>options(radian.escape_key_map = list(
list(key = "-", value = " <- ")
))</code></pre>
<p>
where Iāve added spaces around the operator to mimic how the smart underscore works in <em>Emacs+ESS</em>.
</p>
<p>
Iām really liking using <em>radian</em> for my throw-away R sessions that I typically do in a terminal. The only issue Iāve noticed is that it is a little slow to print tibbles, and clearly itās not going to replace my current IDE ā thatās not what it is designed for. That said, <em>radian</em> can be run inside any app that can run a terminal and Iāve had this running inside <a href="https://code.visualstudio.com/">VS Code</a> for example, which was nice.
</p>
<p>
If you have any comments on <em>radian</em> or other R consoles, let me know what you think below; if youāve used <em>radian</em> Iām especially interested in your experience with it.
</p>
Tibbles, checking examples, & character encodings
Gavin L. Simpson
2019-01-22T07:00:00-06:00
2019-01-22T07:00:00-06:00
https://www.fromthebottomoftheheap.net/2019/01/22/using-tibbles-and-example-checking/
<p>
Recently Iāve been preparing my <a href="https://gavinsimpson.github.io/gratia/"><strong>gratia</strong> package</a> for submission to CRAN. During my pre-flight testing I noticed an issue under Windows checking the examples in the package against the reference output I generated on linux. In the latest release of the <a href="https://tibble.tidyverse.org/"><strong>tibble</strong> package</a>, the way tibbles are printed has changed subtly and in a way that leads to cross-platform differences. As I write this, tibbles with more than a set number of rows are printed in a truncated form, showing only the first 10 rows of data. In such cases, a final line is printed with an ellipsis and a note as to how many more rows are in the tibble. It was this ellipsis that was causing the cross-platform issue where differences between the output generated on windows and the reference output were being identified during <code>R CMD check</code> on Windows. If this is causing you an issue, hereās one way to solve the problem.
</p>
<p>
Recently Iāve been preparing my <a href="https://gavinsimpson.github.io/gratia/"><strong>gratia</strong> package</a> for submission to CRAN. During my pre-flight testing I noticed an issue under Windows checking the examples in the package against the reference output I generated on linux. In the latest release of the <a href="https://tibble.tidyverse.org/"><strong>tibble</strong> package</a>, the way tibbles are printed has changed subtly and in a way that leads to cross-platform differences. As I write this, tibbles with more than a set number of rows are printed in a truncated form, showing only the first 10 rows of data. In such cases, a final line is printed with an ellipsis and a note as to how many more rows are in the tibble. It was this ellipsis that was causing the cross-platform issue where differences between the output generated on windows and the reference output were being identified during <code>R CMD check</code> on Windows. If this is causing you an issue, hereās one way to solve the problem.
</p>
<p>
The problem is this:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'tibble'</span><span class="p">)</span><span class="w">
</span><span class="n">as_tibble</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ā¦ with 140 more rows</code></pre>
</figure>
<p>
Note that little ellipsis on the last line. Yes, those three little dots ā that ā¦ was what was causing all the trouble. Donāt get me wrong, Iām all on board when it comes to proper typography, but for something so small, that one ā¦ caused a good deal of hair-pulling as I prepared my package for a clean submission to CRAN!
</p>
<p>
On Windows you wonāt see that cute little ā¦; instead youāll see this
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">as_tibble</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows</code></pre>
</figure>
<p>
Yes, a rather ugly, second-rate approximation of the ā¦, I think youāll agree!
</p>
<p>
I have to thank Brodie Gaslam <a href="https://twitter.com/BrodieGaslam">(<span class="citation" data-cites="BrodieGaslam">(<span class="citeproc-not-found" data-reference-id="BrodieGaslam"><strong>???</strong></span>)</span>)</a> on Twitter) for identifying the source of the difference between output on Linux and Windows and for suggesting the solution I show below. What Brodie identified was that the <a href="https://github.com/r-lib/cli"><strong>cli</strong></a> package, which <strong>tibble</strong> uses to show this ellipsis, contains code to determine what system it is running on and to adjust itās output accordingly. So, on Linux you see <code>ā¦</code> and on Windows you see <code>ā¦</code>, <em>because</em> (I assume many) Windows systems arenāt set up to understand what <code>ā¦</code> is. What I see on Linux is thanks to Unicode (specifically I have UTF-8 encoding in my Linux sessions) but this doesnāt work (or, as easily) on Windows, which defaults to a different character set or encoding, and which has no idea what <code>ā¦</code> is.
</p>
<p>
As it turns out, there doesnāt appear to be a simple way to make Windows, certainly not the CRAN Windows build system. But what we can do, which is what Brodie mentioned to me on Twitter, is to set a global option that the <strong>cli</strong> package looks for to control its behaviour on <strong>Linux</strong>. Thatās right, weāre going to reduce output generated under <code>R CMD check</code> to the lowest common denominator; the user will still get the benefit of the fancy typography that <strong>cli</strong> affords their R sessions, but we donāt need that fanciness for checking the examples.
</p>
<p>
The option you need is <code>cli.unicode</code> and it needs to be set to <code>FALSE</code>. Here it is in action
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">as_tibble</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span><span class="w">
</span><span class="n">op</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">options</span><span class="p">(</span><span class="n">cli.unicode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">as_tibble</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span><span class="w">
</span><span class="n">options</span><span class="p">(</span><span class="n">op</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ā¦ with 140 more rows
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows</code></pre>
</figure>
<p>
To make this work in an example, you will want to include it in a <code></code> block, which will not show up in the help for the function, nor will it be executed if a user runs the example via <code>example()</code>. Setting the option this way will only happen during testing via <code>R CMD check</code>.
</p>
<p>
In the <a href="https://github.com/klutometis/roxygen"><strong>roxygen2</strong></a> sources for <a href="https://github.com/gavinsimpson/gratia/blob/master/R/derivatives.R#L57-L69">my example</a> I now have
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">@</span><span class="n">examples</span><span class="w">
</span><span class="err">\</span><span class="n">dontshow</span><span class="p">{</span><span class="w">
</span><span class="n">op</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">options</span><span class="p">(</span><span class="n">cli.unicode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># do something here</span><span class="w">
</span><span class="err">\</span><span class="n">dontshow</span><span class="p">{</span><span class="n">options</span><span class="p">(</span><span class="n">op</span><span class="p">)}</span></code></pre>
</figure>
<p>
This idiom is required to handle more issues than just this character encoding problem. Most of my examples use simulated data, so I need a <code>set.seed()</code> call inside <code></code>. If you are showing the results of any statistical model, youāll already be reducing the number of digits shown in the output via <code>options(digits = 5)</code> as CRAN gets annoyed if you are checking results to silly levels of precision. The user doesnāt need to see any of this.
</p>
<p>
I should note that Iām not using this checking of example output as true unit test (thereās loads of those in the <code>/tests</code> folder of the package thank you very much), but I do still think that checking the output of examples against reference output is useful. At the very least it doesnāt (usually) hurt to check the output when itās being generated as part of the checks anyway. I also want useful examples so I tend to show snippets of outputs as part of the example. Having the comparison between expected and actual output is a handy check of what Iām presenting to the user.
</p>
<p>
Hopefully this is useful to people coming across the same or similar issues with their packages. And thanks again to Brodie for explaining what the problem was.
</p>
What's wrong with software paper preprints on EarthArXiv?
Gavin L. Simpson
2018-12-20T00:00:00-06:00
2018-12-20T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2018/12/20/what-is-wrong-with-software-paper-preprints-at-eartharxiv/
<p>
Via <a href="https://twitter.com/geschichtenpost/status/1075747221625339904">Twitter</a> I recently found out that <a href="https://eartharxiv.github.io/index.html">EarthArXiv</a>, a new preprint server for the geosciences doesnāt accept software paper submissions. Actually, EarthArXiv <a href="https://eartharxiv.github.io/moderation.html">doesnāt accept quite a few types of publication</a> ā some justifiably, like <em>ad hominem</em> attack pieces, others unjustifiably like correspondence or opinion pieces. I find this general stance very odd indeed; commentary, editorial or opinion pieces and software papers are accepted in a large number of the general and specialized journals that serve the geoscience field, so why wouldnāt EarthArxiv want to host these prior to publication of the version of record in one of those journals?
</p>
<p>
Via <a href="https://twitter.com/geschichtenpost/status/1075747221625339904">Twitter</a> I recently found out that <a href="https://eartharxiv.github.io/index.html">EarthArXiv</a>, a new preprint server for the geosciences doesnāt accept software paper submissions. Actually, EarthArXiv <a href="https://eartharxiv.github.io/moderation.html">doesnāt accept quite a few types of publication</a> ā some justifiably, like <em>ad hominem</em> attack pieces, others unjustifiably like correspondence or opinion pieces. I find this general stance very odd indeed; commentary, editorial or opinion pieces and software papers are accepted in a large number of the general and specialized journals that serve the geoscience field, so why wouldnāt EarthArxiv want to host these prior to publication of the version of record in one of those journals?
</p>
<p>
The commentary issues bothers me a lot; there is far too little commentary in the geoscience literature and, unless our field is unlike any other, there is a lot to critique. Yet typically this correspondence never sees the light of day or is subject to such draconian restrictions on length that the discussants rarely have the opportunity to fully articulate their concerns or defend their positions. And thatās assuming the commentary is submitted within the time window allowed by the journal or that the editor decides to allow the commentary in the first place. Accepting commentary on published, peer reviewed articles would be a good first step in promoting collegial academic discussion in the literature. Deity knows we need it!
</p>
<p>
Anyway, back to what really annoyed me this morning; not accepting software papers.
</p>
<p>
I was pleased to see that <a href="https://eartharxiv.github.io/moderation.html#software">āEarthArXiv supports scientific software development and citationā</a>. Thatās good to know, because the impression that Iām left with is that software papers arenāt the right sort of thing for EarthArXiv. No reasons for this stance are given beyond the nebulous āYet, software papers often follow citation standards that differ from research and data papers.ā Citation standards also differ for data, but data papers are acceptable (which is a good thing!) at EarthArXiv. So whatās the problem with software papers?
</p>
<p>
Iād like to know because Iām biased; I write a lot of software that is freely available to the community under permissive open source licences. Iām far from being the only one. If researchers who use my or othersā software to analyze their data or prepare their figures can submit preprints to EarthArXiv, why are we barred from submitting preprints about that software? It makes no sense to me.
</p>
<p>
EarthArXiv does give some <a href="https://eartharxiv.github.io/moderation.html#software">useful tips</a> on what you as a software author can do instead;
</p>
<ul>
<li>
Use GitHub ā this really should be āUse version controlā!!
</li>
<li>
Mint a DOI for the repo on Zenodo
</li>
<li>
Publish a paper in <a href="https://openresearchsoftware.metajnl.com/"><acronym title="Journal of Open Research Software">JORS</acronym></a> or <a href="https://joss.theoj.org/"><acronym title="Journal of Open Source Software">JOSS</acronym></a> ā have you ever seen a paper from either of these? I have and what they do is great, but JOSS, and to a lesser extent JORS, donāt publish the kinds of detail one typically finds in a software paper at say Methods in Ecology and Evolution, where the reasons behind method choice or implementation details are regularly presented. They then say āYou will now have a citable āpaperāā, why the scare-quotes? Do the moderators at EarthArXiv not think such papers are real papers?
</li>
</ul>
<p>
The final bit of advice is:
</p>
<blockquote>
<p>
If you really want a software paper on EarthArXiv such that Earth scientists can find it, then we recommend doing all the above plus writing up a short PDF with some Earth science examples showing off the utility. That EarthArXiv PDF would cite the Journal of Open Source Software report
</p>
</blockquote>
<p>
Isnāt that the very definition of a software paper?
</p>
<p>
This leaves me with the impression that the EarthArXiv moderators have a very particular type of software paper in mind and havenāt considered ā or are not aware of ā the broader forms of software papers. One of my software papers is <span class="citation" data-cites="Simpson2007-ya">Simpson (2007)</span>, which describes how to use my <strong>analogue</strong> R package. Another example is <span class="citation" data-cites="Goring2015-vf">Goring et al.Ā (2015)</span> in which we describe and illustrate how to use the <strong>neotoma</strong> R package to access the eponymous database <a href="https://www.neotomadb.org/">Neotoma DB</a>. Those are more typical of the software papers that I am familiar with. Significant effort goes into preparing these papers, easily as much as any other type of research paper. Papers like this serve very different needs than those published by JOSS. Neither of those papers was freely available to colleagues (IIRC) during the review process. A preprint on EarthArXiv would have served the community well.
</p>
<p>
It is frustrating in the extreme that papers like the two personal examples above would not be welcome on EarthArXiv.
</p>
<p>
I do hope that the people at EarthArXiv reconsider their stance on software papers and other types of scholarly work, especially commentary pieces.
</p>
<h2 id="references" class="unnumbered">
References
</h2>
<div id="refs" class="references">
<div id="ref-Goring2015-vf">
<p>
Goring, S., Dawson, A., Simpson, G. L., Ram, K., Graham, R. W., Grimm, E. C., et al.Ā (2015). Neotoma: A programmatic interface to the neotoma paleoecological database. <em>Open Quaternary</em> 1, 1ā17. doi:<a href="https://doi.org/10.5334/oq.ab">10.5334/oq.ab</a>.
</p>
</div>
<div id="ref-Simpson2007-ya">
<p>
Simpson, G. L. (2007). Analogue methods in palaeoecology: Using the analogue package. <em>Journal of statistical software</em> 22, 1ā29.
</p>
</div>
</div>
Confidence intervals for GLMs
Gavin L. Simpson
2018-12-10T08:00:00-06:00
2018-12-10T08:00:00-06:00
https://www.fromthebottomoftheheap.net/2018/12/10/confidence-intervals-for-glms/
<p>
Youāve estimated a GLM or a related model (GLMM, GAM, etc.) for your latest paper and, like a good researcher, you want to visualise the model and show the uncertainty in it. In general this is done using confidence intervals with typically 95% converage. If you remember a little bit of theory from your stats classes, you may recall that such an interval can be produced by adding to and subtracting from the fitted values 2 times their standard error. Unfortunately this only really works like this for a linear model. If I had a dollar (even a Canadian one) for every time Iāve seen someone present graphs of estimated abundance of some species where the confidence interval includes negative abundances, Iād be rich! Here, following the rule of āif Iām asked more than once I should write a blog post about it!ā Iām going to show a simple way to correctly compute a confidence interval for a GLM or a related model.
</p>
<p>
Youāve estimated a GLM or a related model (GLMM, GAM, etc.) for your latest paper and, like a good researcher, you want to visualise the model and show the uncertainty in it. In general this is done using confidence intervals with typically 95% converage. If you remember a little bit of theory from your stats classes, you may recall that such an interval can be produced by adding to and subtracting from the fitted values 2 times their standard error. Unfortunately this only really works like this for a linear model. If I had a dollar (even a Canadian one) for every time Iāve seen someone present graphs of estimated abundance of some species where the confidence interval includes negative abundances, Iād be rich! Here, following the rule of āif Iām asked more than once I should write a blog post about it!ā Iām going to show a simple way to correctly compute a confidence interval for a GLM or a related model.
</p>
<h3 id="why-is-plusminus-two-standard-errors-wrong">
Why is plus/minus two standard errors wrong?
</h3>
<p>
Well, itās not! However, the main reason why people mess up computing confidence intervals for a GLM is that they do all the calculations on the <em>response</em> scale. This results in symmetric intervals on this scale and the very real possibility that the intervals will include values that are nonsensical, like negative abundances and concentrations, or probabilities that are outside the limits of 0 and 1.
</p>
<p>
Think about a Poisson GLM fitted to some species abundance data. In this model there is an implied mean-variance relationship; as the mean count increases so does the variance. In fact, in the Poisson GLM, the mean and variance are the same thing. The implication of this is that as the mean tends to zero, so must the variance. If we had an expected count of zero the variance would also be zero, and our uncertainty about this value would also be zero. However, our model wonāt ever return expected (fitted) values that are exactly equal to zero; it might yield values that are very close to zero, but never exactly zero. In that case we do have some uncertainty about this fitted value; the uncertainty on the lower end has to logically fit somewhere between the small estimated value and zero, but not exactly zero as weāre not creating an interval with 100% coverage.
</p>
<p>
We might also logically expect greater uncertainty above the fitted value, for our upper limit on the confidence interval; weāre saying that the true expected abudance is possibly somewhat larger than the fitted value and due to the mean-variance relationship, a larger fitted value is a larger mean value, which implies a larger variance, and consequently a larger amount of uncertainty above the fitted value than below.
</p>
<p>
Similar arguments can be made for models where there are both upper and lower limits to the response, such as binomial models where the response is a probability bounded between 0 and 1. As the fitted value approaches either boundary the uncertainty about the fitted value in the direction of the boundary gets squished up and the asymmetry of the confidence interval increases.
</p>
<p>
To illustrate, Iāll use a simple data set on wasp visits to leaves of the Cobra Lily, <em>Darlingtonia californica</em>. The data are on my blog and Iāve created a short link using bitly.com. If you want to follow along, load the data and some packages as shown
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## packages</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'readr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'tibble'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">wasp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'http://bit.ly/cobralily'</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">wasp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">wasp</span><span class="p">,</span><span class="w"> </span><span class="n">lvisited</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.logical</span><span class="p">(</span><span class="n">visited</span><span class="p">))</span></code></pre>
</figure>
<p>
The experiment used timed census of visitations by wasps to leaves of the Cobra Lily. These data come from Gotelli & Ellisonās text book <a href="https://global.oup.com/academic/product/a-primer-of-ecological-statistics-9781605350646?cc=ca&lang=en&"><em>A Primer of Ecologisal Satistics</em></a>. Whether or not a wasp visited a leaf during the census was recorded along with the height of the leaf from the ground. The aim is to test the hypothesis that the probability of leaf visitation increases with leaf height.
</p>
<p>
Letās jump right in and fit the GLM, a logistic regression model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">lvisited</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wasp</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">binomial</span><span class="p">())</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Call:
glm(formula = lvisited ~ leafHeight, family = binomial(), data = wasp)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.18274 -0.46820 -0.23897 -0.08519 1.90573
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.29295 2.16081 -3.375 0.000738 ***
leafHeight 0.11540 0.03655 3.158 0.001591 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.105 on 41 degrees of freedom
Residual deviance: 26.963 on 40 degrees of freedom
AIC: 30.963
Number of Fisher Scoring iterations: 6</code></pre>
</figure>
<p>
Now create a basic plot of the data and estimated model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## some data to predict at: 100 values over the range of leafHeight</span><span class="w">
</span><span class="n">ndata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">wasp</span><span class="p">,</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="n">leafHeight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">leafHeight</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">leafHeight</span><span class="p">),</span><span class="w">
</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="c1">## add the fitted values by predicting from the model for the new data</span><span class="w">
</span><span class="n">ndata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">add_column</span><span class="p">(</span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'response'</span><span class="p">))</span><span class="w">
</span><span class="c1">## plot it</span><span class="w">
</span><span class="n">plt</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_rug</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">visited</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lvisited</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wasp</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Visited'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Leaf height (cm.)'</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Probability of visitation'</span><span class="p">)</span><span class="w">
</span><span class="n">plt</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/confidence-intervals-for-glms-darlingtonia-plot-fit-1.png" alt="Estimated probability of visitation as a function of leaf height." />
<figcaption>
Estimated probability of visitation as a function of leaf height.
</figcaption>
</figure>
<p>
Next, to illustrate the issue, Iāll create the confidence interval the <em>wrong</em> way
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## add standard errors</span><span class="w">
</span><span class="n">ndata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">add_column</span><span class="p">(</span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">wrong_se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'response'</span><span class="p">,</span><span class="w">
</span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="o">$</span><span class="n">se.fit</span><span class="p">)</span><span class="w">
</span><span class="c1">## compute a 95% interval the wrong way</span><span class="w">
</span><span class="n">ndata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">wrong_upr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">wrong_se</span><span class="p">),</span><span class="w"> </span><span class="n">wrong_lwr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">wrong_se</span><span class="p">))</span></code></pre>
</figure>
<p>
and plot the resulting interval
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plt</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wrong_lwr</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wrong_upr</span><span class="p">),</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/confidence-intervals-for-glms-add-wrong-interval-1.png" alt="Estimated probability of visitation as a function of leaf height with an incorrectly-computed 95% confidence interval superimposed. Notice the interval exceeds the probability limits, 0 and 1." />
<figcaption>
Estimated probability of visitation as a function of leaf height with an incorrectly-computed 95% confidence interval superimposed. Notice the interval exceeds the probability limits, 0 and 1.
</figcaption>
</figure>
<p>
Thatās problematic because for significant sections of <code>leafHeight</code> our uncertainty interval breaks the laws of probability.
</p>
<p>
So, when creating confidence intervals we should expect asymmetric confidence intervals that respect the physical limits of the values that the response variable can take. If they donāt, then youāve probably computed them the wrong way.
</p>
<p>
The previous paragraphs walked through a logical reason why confidence intervals are not symmetric on the response scale. The theory behind adding/subtracting two times the standard error is also derived for models where the response is conditionally Gaussian. It doesnāt really work properly at all when the response is not conditionally distributed Gaussian; you only need to realise that a confidence interval that includes impossible values canāt possibly have the coverage properties claimed because some part of it lies in a space of values that just wonāt ever be observed.
</p>
<h3 id="confidence-intervals-the-right-way">
Confidence intervals the right way
</h3>
<p>
How do we create correct confidence intervals?
</p>
<p>
A simple solution is to create the interval on the scale of the link function and not the response scale. On the link scale, weāre essentially treating the model as a fancy linear one anyway; we asssume that things are approximately Gaussian here, at least with very large sample sizes. Given that assumption, we can create a confidence interval as the fitted value plus or minuss two times the standard error on the link scale, and the use the inverse of the link function to map the fitted values and the upper and lower limits of the interval back on to the response scale.
</p>
<p>
If you paid attention in your stats classes, you might know that the default link for the Poisson GLM is the logarithm link. You might also know that the inverse of taking logs is exponentiation. You may even know that exponentiation is done in R using the <code>exp()</code> function. But whatās the inverse of the logit function, which was the link used in our model for leaf visitation? Even if you knew what the correct mathematical function was, would you know what R function to use for this? And I defy most readers to know what the inverse of the complementary-log-log link function is, which we could have used instead of the logit link in our model. This problem only gets worse when we start thinking about models that walk and quack like a GLM but arenāt really GLMs in the strict sense, but which use families that are outside the usual suspects of the exponential family of distributions.
</p>
<p>
All is not lost however as there is a little trick that you can use to always get the correct inverse of the link function used in a model. (Well, always is a bit strong; the model needs to follow standard R conventions and accept a <code>family</code> argument and return the <code>family</code> inside the fitted model object.)
</p>
<p>
Typically in R, functions that fit generalized models take a <code>family</code> argument and return a <code>family</code> object that we can extract from the model itself. That <code>family</code> object contains all the information we need to create proper confidence intervals for GLMs and related models.
</p>
<p>
For the logistic regression model we fitted earlier, the family object is the same as that returned by <code>binomial(link = ālogitā)</code>, and we can extract it directly from the model using the extractor function <code>family()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">fam</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">family</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span><span class="w">
</span><span class="n">fam</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">fam</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: binomial
Link function: logit
List of 12
$ family : chr "binomial"
$ link : chr "logit"
$ linkfun :function (mu)
$ linkinv :function (eta)
$ variance :function (mu)
$ dev.resids:function (y, mu, wt)
$ aic :function (y, n, mu, wt, dev)
$ mu.eta :function (eta)
$ initialize: expression({ if (NCOL(y) == 1) { if (is.factor(y)) y <- y != levels(y)[1L] n <- rep.int(1, nobs) y[weights =| __truncated__
$ validmu :function (mu)
$ valideta :function (eta)
$ simulate :function (object, nsim)
- attr(*, "class")= chr "family"</code></pre>
</figure>
<p>
If you look closely youāll see a component named <code>linkinv</code> which is indicated to be a function. This is the <em>inverse</em> of the link function. The link function itself is in the <code>linkfun</code> component of the family. If we extract this function and look at it
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ilink</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fam</span><span class="o">$</span><span class="n">linkinv</span><span class="w">
</span><span class="n">ilink</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">function (eta)
.Call(C_logit_linkinv, eta)
<environment: namespace:stats></code></pre>
</figure>
<p>
we see something very simple involving an argument named <code>eta</code>, which stands for the linear predictor and means we need to provide values on the link scale as they would be computed directly from the linear predictor, <span class="math inline">()</span> (this is the Greek letter <em>eta</em>). In this instance the function calls out to compiled C code to compute the neccessary values, but others are easier to understand and use simple R code, e.g.Ā for the log link in the <code>poisson()</code> family we have
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">poisson</span><span class="p">()</span><span class="o">$</span><span class="n">linkinv</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">function (eta)
pmax(exp(eta), .Machine$double.eps)
<environment: namespace:stats></code></pre>
</figure>
<p>
This shows that we exponentiate <code>eta</code> (which we know is the correct inverse function), and this is wrapped in <code>pmax()</code> to insure that the function doesnāt return values smaller than <code>.Machine<span class="math inline">\(double.eps</code>, the smallest (positive floating point) value <span class="math inline">\(x\)</span> such that <span class="math inline">\(1 + x \neq 1\)</span>.</p> <p>Now that we have a (generally) reliable way of getting the link function used when fitting a model, we can adapt thestrategy we used earlier so that we get the right (approximately) confidence interval. For this we need to</p> <ul> <li>generate fitted values and standard errors on the link scale, using <code>predict(...., type = 'link')</code>, which happens to be the default in general, and</li> <li>compute the confidence interval using these fitted values and standard errors, and then backtransform them to the response scale using the inverse of the link function we extracted from the model.</li> </ul> <p>For the wasp visitation logistic regression model then, we can do this using the following bit of code</p> <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="c1">## grad the inverse link function</span><span class="w"> </span><span class="n">ilink</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">family</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span><span class="o">\)</span></span><span class="n">linkinv</span><span class="w"> </span><span class="c1">## add fit and se.fit on the <strong>link</strong> scale</span><span class="w"> </span><span class="n">ndata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_cols</span><span class="p">(</span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">as_tibble</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">āfit_linkā</span><span class="p">,</span><span class="s1">āse_linkā</span><span class="p">)))</span><span class="w"> </span><span class="c1">## create the interval and backtransform</span><span class="w"> </span><span class="n">ndata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">ndata</span><span class="p">,</span><span class="w"> </span><span class="n">fit_resp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit_link</span><span class="p">),</span><span class="w"> </span><span class="n">right_upr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit_link</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o"><em></span><span class="w"> </span><span class="n">se_link</span><span class="p">)),</span><span class="w"> </span><span class="n">right_lwr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit_link</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o"></em></span><span class="w"> </span><span class="n">se_link</span><span class="p">)))</span><span class="w"> </span><span class="c1">## show</span><span class="w"> </span><span class="n">ndata</span></code>
</pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 100 x 10
leafHeight fit wrong_se wrong_upr wrong_lwr fit_link se_link
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 14 0.00341 0.00567 0.0147 -0.00792 -5.68 1.67
2 14.7 0.00370 0.00605 0.0158 -0.00840 -5.60 1.64
3 15.4 0.00401 0.00646 0.0169 -0.00891 -5.51 1.62
4 16.1 0.00435 0.00690 0.0182 -0.00945 -5.43 1.59
5 16.8 0.00472 0.00737 0.0195 -0.0100 -5.35 1.57
6 17.5 0.00512 0.00786 0.0208 -0.0106 -5.27 1.54
7 18.2 0.00555 0.00839 0.0223 -0.0112 -5.19 1.52
8 18.9 0.00602 0.00895 0.0239 -0.0119 -5.11 1.49
9 19.7 0.00653 0.00954 0.0256 -0.0125 -5.02 1.47
10 20.4 0.00708 0.0102 0.0274 -0.0133 -4.94 1.45
# ... with 90 more rows, and 3 more variables: fit_resp <dbl>,
# right_upr <dbl>, right_lwr <dbl></code></pre>
</figure>
<p>
and now we can draw this interval on our plot from before
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plt</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ndata</span><span class="p">,</span><span class="w">
</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">right_lwr</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">right_upr</span><span class="p">),</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/confidence-intervals-for-glms-plot-right-confidence-interva-1.png" alt="Estimated probability of visitation as a function of leaf height with a correctly-computed 95% confidence interval superimposed. Notice the interval now doesnāt exceed the probability limits, 0 and 1." />
<figcaption>
Estimated probability of visitation as a function of leaf height with a correctly-computed 95% confidence interval superimposed. Notice the interval now doesnāt exceed the probability limits, 0 and 1.
</figcaption>
</figure>
<p>
And now we have confidence intervals that donāt exceed the physical boundaries of the response scale.
</p>
<p>
If you want different coverage for the intervals, replace the <code>2</code> in the code with some other extreme quantile of the standard normal distribution, e.g.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qnorm</span><span class="p">(</span><span class="m">0.005</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="c1"># for a 99% interval (0.5% in each tail)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 2.575829</code></pre>
</figure>
<p>
and if weāre being picky, if you have a small sample size and fitted a Gaussian GLM, then a critical value from the <em>t</em> distribution should be used
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qt</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">mod</span><span class="p">),</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 2.021075</code></pre>
</figure>
<p>
where Iām using the <code>df.residual()</code> extractor function to get residual degrees of freedom for the <em>t</em> distribution. This makes little sense for a logistic regression, but letās just assume <code>mod</code> is a Gaussian GLM in this instance.
</p>
<p>
There we have it; a simple way to reliably compute confidence intervals for GLMs and related models fitted via well-behaved R model-fitting functions.
</p>
Introducing gratia
Gavin L. Simpson
2018-10-23T06:00:00-06:00
2018-10-23T06:00:00-06:00
https://www.fromthebottomoftheheap.net/2018/10/23/introducing-gratia/
<p>
I use generalized additive models (GAMs) in my research work. I use them a lot! Simon Woodās <strong>mgcv</strong> package is an excellent set of software for specifying, fitting, and visualizing GAMs for very large data sets. Despite recently dabbling with <strong>brms</strong>, <strong>mgcv</strong> is still my go-to GAM package. The only down-side to <strong>mgcv</strong> is that it is not very tidy-aware and the <strong>ggplot</strong>-verse may as well not exist as far as it is concerned. This in itself is no bad thing, though as someone who uses <strong>mgcv</strong> a lot but also prefers to do my plotting with <strong>ggplot2</strong>, this lack of awareness was starting to hurt. So, I started working on something to help bridge the gap between these two separate worlds that I inhabit. The fruit of that labour is <strong>gratia</strong>, and development has progressed to the stage where I am ready to talk a bit more about it.
</p>
<p>
<strong>gratia</strong> is an R package for working with GAMs fitted with <code>gam()</code>, <code>bam()</code> or <code>gamm()</code> from <strong>mgcv</strong> or <code>gamm4()</code> from the <strong>gamm4</strong> package, although functionality for handling the latter is not yet implement. <strong>gratia</strong> provides functions to replace the base-graphics-based <code>plot.gam()</code> and <code>gam.check()</code> that <strong>mgcv</strong> provides with <strong>ggplot2</strong>-based versions. Recent changes have also resulted in <strong>gratia</strong> being much more <strong>tidyverse</strong> aware and it now (mostly) returns outputs as tibbles.
</p>
<p>
In this post I wanted to give a flavour of what is currently possible with <strong>gratia</strong> and outline what still needs to be implemented.
</p>
<p>
I use generalized additive models (GAMs) in my research work. I use them a lot! Simon Woodās <strong>mgcv</strong> package is an excellent set of software for specifying, fitting, and visualizing GAMs for very large data sets. Despite recently dabbling with <strong>brms</strong>, <strong>mgcv</strong> is still my go-to GAM package. The only down-side to <strong>mgcv</strong> is that it is not very tidy-aware and the <strong>ggplot</strong>-verse may as well not exist as far as it is concerned. This in itself is no bad thing, though as someone who uses <strong>mgcv</strong> a lot but also prefers to do my plotting with <strong>ggplot2</strong>, this lack of awareness was starting to hurt. So, I started working on something to help bridge the gap between these two separate worlds that I inhabit. The fruit of that labour is <strong>gratia</strong>, and development has progressed to the stage where I am ready to talk a bit more about it.
</p>
<p>
<strong>gratia</strong> is an R package for working with GAMs fitted with <code>gam()</code>, <code>bam()</code> or <code>gamm()</code> from <strong>mgcv</strong> or <code>gamm4()</code> from the <strong>gamm4</strong> package, although functionality for handling the latter is not yet implement. <strong>gratia</strong> provides functions to replace the base-graphics-based <code>plot.gam()</code> and <code>gam.check()</code> that <strong>mgcv</strong> provides with <strong>ggplot2</strong>-based versions. Recent changes have also resulted in <strong>gratia</strong> being much more <strong>tidyverse</strong> aware and it now (mostly) returns outputs as tibbles.
</p>
<p>
In this post I wanted to give a flavour of what is currently possible with <strong>gratia</strong> and outline what still needs to be implemented. <!--more-->
</p>
<p>
<strong>gratia</strong> currently lives on GitHub, so we need to install it from there using <code>devtools::install_github</code>:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s1">'gavinsimpson/gratia'</span><span class="p">)</span></code></pre>
</figure>
<p>
To do anything useful with <strong>gratia</strong> we need a GAM and for that we need <strong>mgcv</strong>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'gratia'</span><span class="p">)</span></code></pre>
</figure>
<p>
and an old favourite example data set
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamSim</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">400</span><span class="p">,</span><span class="w"> </span><span class="n">dist</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<p>
The simulated data in <code>dat</code> are well-studied in GAM-related research and contain a number of covariates ā labelled <code>x0</code> through <code>x3</code> ā which have, to varying degrees, non-linear relationships with the response. We want to try to recover these relationships by approximating the true relationships between covariate and response using splines. To fit a purely additive model, we use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x3</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
<strong>mgcv</strong> provides a <code>summary()</code> method that is used to extract information about the fitted GAM
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7625 0.0959 80.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 3.528 4.370 11.30 4.2e-09 ***
s(x1) 2.662 3.310 129.02 < 2e-16 ***
s(x2) 8.146 8.799 84.72 < 2e-16 ***
s(x3) 1.001 1.002 0.00 0.987
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.763 Deviance explained = 77.2%
-REML = 850.87 Scale est. = 3.6785 n = 400</code></pre>
</figure>
<p>
and the <code>k.check()</code> function for checking whether sufficient numbers of basis functions were used in each smooth in the model. (You may not have used <code>k.check()</code> directly ā it is called by <code>gam.check()</code> which prints out other diagnostics and also produces four model diagnostic plots, which is one thing that <strong>gratia</strong> provides a replacement for.)
</p>
<h3 id="plotting-smooths">
Plotting smooths
</h3>
<p>
To visualize estimated GAMs, <strong>mgcv</strong> provides the <code>plot.gam()</code> method and the <code>vis.gam()</code> function. <strong>gratia</strong> currently provides a <strong>ggplot2</strong>-based replacement for <code>plot.gam()</code>. Work is on-going to provide <code>vis.gam()</code>-like functionality within <strong>gratia</strong> ā see <code>?gratia::data_slice</code> for early work in that direction. In <strong>gratia</strong>, we use the <code>draw()</code> generic to produce <strong>ggplot2</strong>-like plots from objects. To visualize the four estimated smooth functions in the GAM <code>mod</code>, we would use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">draw</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/introducing-gratia-draw-mod-1.png" alt="The result of draw(mod) is a plot of each of the four smooth functions in the mod GAM." />
<figcaption>
The result of <code>draw(mod)</code> is a plot of each of the four smooth functions in the <code>mod</code> GAM.
</figcaption>
</figure>
<p>
Internally <code>draw()</code> uses the <code>plot_grid()</code> function from <strong>cowplot</strong> to draw multiple panels on the plot device, and to line up the individual plots.
</p>
<p>
Thereās not an awful lot more you can do with this now, but the at least the plot is reasonably pretty. <strong>gratia</strong> includes tools for working with the underlying smooths represented in <code>mod</code>, and if you wanted to extract most of the data used to build the plot youād use the <code>evaluate_smooth()</code> function.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">evaluate_smooth</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="s2">"x1"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 100 x 5
smooth fs_variable x1 est se
<chr> <fct> <dbl> <dbl> <dbl>
1 s(x1) <NA> 0.000565 -2.75 0.294
2 s(x1) <NA> 0.0106 -2.72 0.277
3 s(x1) <NA> 0.0207 -2.68 0.261
4 s(x1) <NA> 0.0308 -2.64 0.245
5 s(x1) <NA> 0.0409 -2.60 0.230
6 s(x1) <NA> 0.0510 -2.56 0.217
7 s(x1) <NA> 0.0610 -2.52 0.204
8 s(x1) <NA> 0.0711 -2.48 0.193
9 s(x1) <NA> 0.0812 -2.44 0.183
10 s(x1) <NA> 0.0913 -2.40 0.173
# ... with 90 more rows</code></pre>
</figure>
<h3 id="producing-diagnostic-plots">
Producing diagnostic plots
</h3>
<p>
The diagnostic plots currently produced by <code>gam.check()</code> can also be produced using <strong>gratia</strong>, with the <code>appraise()</code> function
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">appraise</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/introducing-gratia-appraise-mod-1.png" alt="The result of appraise(mod) is an array of four diagnostics plots, including a Q-Q plot (top left) and histogram (bottom left) of model residuals, a plot of residuals vs the linear predictor (top right), and a plot of observed vs fitted values." />
<figcaption>
The result of <code>appraise(mod)</code> is an array of four diagnostics plots, including a Q-Q plot (top left) and histogram (bottom left) of model residuals, a plot of residuals vs the linear predictor (top right), and a plot of observed vs fitted values.
</figcaption>
</figure>
<p>
Each of the four plots is produced via user-accessible function that implements a specific plot. For example, <code>qq_plot(mod)</code> produces the Q-Q plot in the upper left for the figure above, and the <code>qq_plot.gam()</code> method reproduces most of the functionality of <code>mgcv::qq.gam()</code>, including the direct randomization procedure (<code>method = ādirectā</code>, as shown above) and the data simulation procedure (<code>method = āsimulateā</code>) to generate reference quantiles, which typically have better performance for GLM-like models <span class="citation" data-cites="Augustin2012-sc">(Augustin et al., 2012)</span>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qq_plot</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'simulate'</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/introducing-gratia-qq-plot-mod-1.png" alt="The result of qq_plot(mod, method = 'simulate', fig.width = 6, fig.height = 4) is a Q-Q plot of residuals, where the reference quantiles are derived by simulating data from the fitted model." />
<figcaption>
The result of <code>qq_plot(mod, method = āsimulateā, fig.width = 6, fig.height = 4)</code> is a Q-Q plot of residuals, where the reference quantiles are derived by simulating data from the fitted model.
</figcaption>
</figure>
<p>
<code>draw()</code> can also handle many of the more specialized smoothers currently available in <strong>mgcv</strong>. For example, 2D smoothers are represented as <code>geom_raster()</code> surfaces with contours
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamSim</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4000</span><span class="p">,</span><span class="w"> </span><span class="n">dist</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normal"</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">data</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">draw</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/introducing-gratia-draw-2d-mod-1.png" alt="The default way a 2D smoother is plotted using draw()." />
<figcaption>
The default way a 2D smoother is plotted using <code>draw()</code>.
</figcaption>
</figure>
<p>
and factor-smooth-interaction terms, which are the equivalent of random slopes and intercepts for splines, are drawn on a single panel and colour is used to distinguish the different random smooths
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## simulate example... from ?mgcv::factor.smooth.interaction</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="c1">## simulate data...</span><span class="w">
</span><span class="n">f0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="nb">pi</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">f1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="o">=</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="o">=</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">a</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="o">+</span><span class="n">b</span><span class="w">
</span><span class="n">f2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="m">0.2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="m">11</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">10</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="p">))</span><span class="o">^</span><span class="m">6</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">10</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="o">^</span><span class="m">3</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="o">^</span><span class="m">10</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">500</span><span class="w">
</span><span class="n">nf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">fac</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">nf</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="o">=</span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">x0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">nf</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">.2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">2</span><span class="p">;</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">nf</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">.5</span><span class="w">
</span><span class="n">f</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">f0</span><span class="p">(</span><span class="n">x0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">f1</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="p">[</span><span class="n">fac</span><span class="p">],</span><span class="w"> </span><span class="n">b</span><span class="p">[</span><span class="n">fac</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">f2</span><span class="p">(</span><span class="n">x2</span><span class="p">)</span><span class="w">
</span><span class="n">fac</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">fac</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">f</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x0</span><span class="p">,</span><span class="w"> </span><span class="n">x1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x2</span><span class="p">,</span><span class="w"> </span><span class="n">fac</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fac</span><span class="p">)</span><span class="w">
</span><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="o">~</span><span class="n">s</span><span class="p">(</span><span class="n">x0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">fac</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="o">=</span><span class="s2">"fs"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="o">=</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="o">=</span><span class="m">20</span><span class="p">),</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ML"</span><span class="p">)</span><span class="w">
</span><span class="n">draw</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/introducing-gratia-draw-fs-mod-1.png" alt="The result of draw(mod) for a more complex GAM containing a factor-smooth-interaction term with bs = 'fs'." />
<figcaption>
The result of <code>draw(mod)</code> for a more complex GAM containing a factor-smooth-interaction term with <code>bs = āfsā</code>.
</figcaption>
</figure>
<h3 id="what-else-can-gratia-do">
What else can gratia do?
</h3>
<p>
Although still quite early in the planned development cycle, <strong>gratia</strong> can handle most of the smooths that <strong>mgcv</strong> can estimate, including <code>by</code> variable smooths with factor and continuous <code>by</code> variables, random effect smooths (<code>bs = āreā</code>), 2D tensor product smooths, and models with parametric terms.
</p>
<p>
Smoothers that <strong>gratia</strong> canāt do anything with as yet are Markov random fields (MRFs; <code>bs = āmrfā</code>), splines on the sphere (SoSs; <code>bs = āsosā</code>), soap film smoothers (<code>bs = āsoā</code>), and linear functional models with matrix terms.
</p>
<p>
The package also includes functions for
</p>
<ul>
<li>
calculating across-the-function and simultaneous confidence intervals for smooths via <code>confint()</code> methods, and
</li>
<li>
calculating first and second derivatives (of currently only univariate) smooths using finite differences. <code>fderiv()</code> is the old home for first derivatives of GAM smooths, whilst the new <code>derivatives()</code> function can calculate first and second derivatives using forward (like <code>fderiv()</code>) as well as backward and central finite differences.
</li>
</ul>
<p>
There is also are lot of exported functions that make it easier for working with GAMs fitted by <strong>mgcv</strong> and for extracting aspects of the fitted model and the smooths. The exact functionality is still being worked on so be prepared for some of the functions to come and go or change name as I work through ideas and implementations and settle on the interface for the tools that <strong>gratia</strong> will provide for this.
</p>
<h3 id="what-cant-gratia-do">
What canāt gratia do?
</h3>
<p>
Iāve already covered where <strong>gratia</strong> is currently lacking in respect to the types of smoother that <strong>mgcv</strong> can fit. It is also currently lacking in tools for exploring models in more detail, such as the plots of model predictions over slices of covariate space that <code>vis.gam()</code> can produce (though see <code>gratia::data_slice()</code> for functions to create the data needed for such plots.) Nor can <strong>gratia</strong> currently handle smooths of higher than two dimensions. Iād like to add this capability soon as it will make visualizing GAMs fitted to spatio-temporal data much easier then it currently is.
</p>
<h3 id="the-future">
The future?
</h3>
<p>
Longer term I plan to fill out the types of smoothers that <strong>gratia</strong> can handle to cover all the types that <strong>mgcv</strong> can fit, and add <code>vis.gam()</code>-like functionality and the ability to handle higher dimensional smooths (<code>plot.gam()</code> can now handle 3- or 4-dimensional smooths.)
</p>
<p>
The ultimate goal of course is to just have <code>draw()</code> work for whatever GAM model you throw at it, and at least have feature parity with <code>plot.gam()</code> and <code>vis.gam()</code>.
</p>
<p>
As is to be expected for such an early release, there is a lot of stabilization to function names and arguments that needs to happen in <strong>gratia</strong>, and a lot of documentation to be written, including some vignettes. For now, the best way to understand what <strong>gratia</strong> is doing or how it works is to look at the examples on the <strong>gratia</strong> <a href="https://gavinsimpson.github.io/gratia/">website</a> (built using <strong>pkgdown</strong>) and take a look at the <a href="https://github.com/gavinsimpson/gratia/tree/master/tests/testthat">package tests</a> which contain lots of examples of GAM fits and the code to work with them.
</p>
<p>
Iām very much interested in user feedback, so please do let me know if you have any suggestions for additions or improvements to <strong>gratia</strong>, and if you do use <strong>gratia</strong> and find bugs in the package or GAMs that <strong>gratia</strong> canāt handle I would love to hear from you. You can get in touch via the comments below, or via <a href="https://github.com/gavinsimpson/gratia/issues">GitHub Issues</a>.
</p>
<p>
I would also be remiss if I did not mention Matteo Fasioloās excellent <a href="https://mfasiolo.github.io/mgcViz/"><strong>mcgViz</strong> package</a>, which already has extensive capabilities for exploring GAM fits, including some very interesting approaches to handling models of millions of data points or more, which cause data visualization problems.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Augustin2012-sc">
<p>
Augustin, N. H., Sauleau, E.-A., and Wood, S. N. (2012). On quantile quantile plots for generalized linear models. <em>Computational statistics & data analysis</em> 56, 2404ā2409. doi:<a href="https://doi.org/10.1016/j.csda.2012.01.026">10.1016/j.csda.2012.01.026</a>.
</p>
</div>
</div>
Controls on subannual variation in pCO<sub>2</sub> in productive hardwater lakes
Gavin L. Simpson
2018-10-15T11:00:00-06:00
2018-10-15T11:00:00-06:00
https://www.fromthebottomoftheheap.net/2018/10/15/wiik-jgr-co2-paper/
<p>
This year is looking like a bumper year for papers from the lab and collaborations, past and ongoing. Over the <a href="/2018/10/15/summer-hiatus/">summer hiatus</a> three papers came out online in their version-of-record form. The first of these was a paper on work that Emma Wiik, a former postdoc in my lab and Peter Leavittās lab, conducted to further our research on the controls on CO<sub>2</sub> exchange between lakes and the atmosphere.
</p>
<p>
This year is looking like a bumper year for papers from the lab and collaborations, past and ongoing. Over the <a href="/2018/10/15/summer-hiatus/">summer hiatus</a> three papers came out online in their version-of-record form. The first of these was a paper on work that Emma Wiik, a former postdoc in my lab and Peter Leavittās lab, conducted to further our research on the controls on CO<sub>2</sub> exchange between lakes and the atmosphere.
</p>
<p>
Lakes play an important role in processing terrestrial carbon and influence carbon fluxes at the global scale. Unpacking the detail of the respective controls on CO<sub>2</sub> exchange with the atmosphere is an active and productive area of limnological research. In 2015, we published <span class="citation" data-cites="Finlay2015-bw">(Finlay et al., 2015)</span> an analysis of time series data of CO<sub>2</sub> flux from hardwater prairie lakes, which showed that as these lakes warmed due to climate change, the efflux of CO<sub>2</sub> from the lakes actually decreased. This result was contrary to those observed in northern Boreal lakes, and reflects the need to study a range of lake types when generalizing from individual research projects to global scale assessments of the role of lakes in the carbon cycle.
</p>
<p>
Emmaās paper <span class="citation" data-cites="Wiik2018-ve">(Wiik et al., 2018)</span>, which was published in <a href="https://doi.org/10.1029/2018JG004506">Journal of Geophysical Research: Biogeosciences</a> in May, took a closer look than the 2015 paper at the controls on CO<sub>2</sub> exchange. Across the six QuāAppelle lakes in the the 2015 study, weād focused on trends in pH and CO<sub>2</sub> flux and the control of annual CO<sub>2</sub> flux by ice-cover duration, yielding results that spoke to the multi-annual to decadal scale relationships between CO<sub>2</sub> exchange and the important drivers. In the new paper, we used generalized additive models (GAMs) to model the full 18-year time series of limnological data.
</p>
<p>
Two GAMs were fitted and described in the paper. The first modelled CO<sub>2</sub> flux as a smooth function of lake pH over all six lakes, allowing for lake-specific effects of pH on CO<sub>2</sub> as well as accounting for change over time. Our CO<sub>2</sub> data were not directly measured, instead being calculated from geochemical equations, including pH. Hence this first model was simply to quantify how much of the variation in CO<sub>2</sub> we could explain using pH. As the latter was used to calculate the former, the explained variation was high, but never equal to 1.
</p>
<p>
Having established that pH was the primary control on CO<sub>2</sub> exchange in the six study lakes we wanted to try to model the lake water pH observations using a series of selected climatic and metabolic variables, chosen to reflect the major factors thought to control CO<sub>2</sub> exchange. A second GAM was fitted with pH as the response variable and lake specific smooth functions of the metabolic and climatic variables.
</p>
<p>
Through the second GAM, we were able to show that in the six QuāAppelle study lakes that metabolic drivers of CO<sub>2</sub> flux we more important at the dailyāmonthly scale than climatic drivers, while the latter were more important at the interannual scale.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/wiik-et-al-2018-figure-4.png" title="Figure 4 from Wiik et al 2018." alt="Figure 4 from the paper. (aāc) GAM partial effect splines for significant metabolic variables. Dotted lines: means of y and x; Shaded area: middle 90% of all observations. Rug: data points. (a) GAM splines for chlorophyll a, with lakes with significantly different splines to the global spline indicated by color/hue and linetype. (b) GAM spline of oxygen, with standard errors indicated by shading. (c) GAM spline of dissolved organic carbon, with standard errors indicated by shading." />
<figcaption>
Figure 4 from the paper. (aāc) GAM partial effect splines for significant metabolic variables. Dotted lines: means of <span class="math inline">(y)</span> and <span class="math inline">(x)</span>; Shaded area: middle 90% of all observations. Rug: data points. (a) GAM splines for chlorophyll a, with lakes with significantly different splines to the global spline indicated by color/hue and linetype. (b) GAM spline of oxygen, with standard errors indicated by shading. (c) GAM spline of dissolved organic carbon, with standard errors indicated by shading.
</figcaption>
</figure>
<p>
The paper is available from the <a href="https://doi.org/10.1029/2018JG004506">journal website</a> or via a <a href="/assets/reprints/wiik-2018-jgr-b-co2-preprint.pdf">preprint</a> if you do not have access to Journal of Geophysical Research: Biogeosciences.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Finlay2015-bw">
<p>
Finlay, K., Vogt, R. J., Bogard, M. J., Wissel, B., Tutolo, B. M., Simpson, G. L., et al.Ā (2015). Decrease in CO2 efflux from northern hardwater lakes with increasing atmospheric warming. <em>Nature</em> 519, 215ā218. doi:<a href="https://doi.org/10.1038/nature14172">10.1038/nature14172</a>.
</p>
</div>
<div id="ref-Wiik2018-ve">
<p>
Wiik, E., Haig, H. A., Hayes, N. M., Finlay, K., Simpson, G. L., Vogt, R. J., et al.Ā (2018). Generalized additive models of climatic and metabolic controls of subannual variation in pCO 2 in productive hardwater lakes. <em>Journal of Geophysical Research: Biogeosciences</em> 123, 1940ā1959. doi:<a href="https://doi.org/10.1029/2018JG004506">10.1029/2018JG004506</a>.
</p>
</div>
</div>
Summer hiatus
Gavin L. Simpson
2018-10-15T07:00:00-06:00
2018-10-15T07:00:00-06:00
https://www.fromthebottomoftheheap.net/2018/10/15/summer-hiatus/
<p>
Itās been quite some time since I last posted anything here. Mostly this was due to a very busy schedule since May that included teaching an online stats course, attending & presenting at three conferences, giving workshops at two of those conferences, and taking some well-earned vacation in Europe. Summer was also a busy time for manuscripts moving through the pipeline to being accepted and published. One thing I had hoped to do with the blog this year was publicize some of the work I do a little more. So, as normal service resumes here I hope to post some short pieces highlighting new papers that came out over the summer, and a few of these will be coming out over the next week or two.
</p>
<p>
One of the reasons for having this blog in the first place was to get me back into āwriting modeā; I find it difficult at times, especially when the to-do list is long, to force myself to carve out time to both think <em>and</em> write. And as I get more and more out of practice writing, it takes more and more time to start or pick up work on manuscripts describing new results, and the words donāt flow easily at all. I find it much easier to write when I am towards the end of a writing period because Iāve literally forced myself to write. And, whilst blog posts arenāt the same kind of writing as for manuscripts, I hope that by just doing a little writing each week, itāll be that bit easier to pick up work on a languishing manuscript or start something new.
</p>
<p>
Letās see how I get onā¦
</p>
<p>
Itās been quite some time since I last posted anything here. Mostly this was due to a very busy schedule since May that included teaching an online stats course, attending & presenting at three conferences, giving workshops at two of those conferences, and taking some well-earned vacation in Europe. Summer was also a busy time for manuscripts moving through the pipeline to being accepted and published. One thing I had hoped to do with the blog this year was publicize some of the work I do a little more. So, as normal service resumes here I hope to post some short pieces highlighting new papers that came out over the summer, and a few of these will be coming out over the next week or two.
</p>
<p>
One of the reasons for having this blog in the first place was to get me back into āwriting modeā; I find it difficult at times, especially when the to-do list is long, to force myself to carve out time to both think <em>and</em> write. And as I get more and more out of practice writing, it takes more and more time to start or pick up work on manuscripts describing new results, and the words donāt flow easily at all. I find it much easier to write when I am towards the end of a writing period because Iāve literally forced myself to write. And, whilst blog posts arenāt the same kind of writing as for manuscripts, I hope that by just doing a little writing each week, itāll be that bit easier to pick up work on a languishing manuscript or start something new.
</p>
<p>
Letās see how I get onā¦
</p>
<!--more-->
Fitting GAMs with brms: part 1
Gavin L. Simpson
2018-04-21T04:00:00-06:00
2018-04-21T04:00:00-06:00
https://www.fromthebottomoftheheap.net/2018/04/21/fitting-gams-with-brms/
<p>
Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the <strong>mgcv</strong> package. I use these models all the time in my research but recently weāve been hitting the limits of the range of models that <strong>mgcv</strong> can fit. So Iāve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The <strong>brms</strong> package <span class="citation" data-cites="brms-2017">(BĆ¼rkner, 2017)</span> is an excellent resource for modellers, providing a high-level R front end to a vast array of model types, all fitted using <a href="http://mc-stan.org">Stan</a>. <strong>brms</strong> is the perfect package to go beyond the limits of <strong>mgcv</strong> because <strong>brms</strong> even uses the smooth functions provided by <strong>mgcv</strong>, making the transition easier. In this post I take a look at how to fit a simple GAM in <strong>brms</strong> and compare it with the same model fitted using <strong>mgcv</strong>.
</p>
<div id="refs" class="references">
<div id="ref-brms-2017">
<p>
BĆ¼rkner, P.-C. (2017). brms: An R package for bayesian multilevel models using Stan. <em>Journal of Statistical Software</em> 80, 1ā28. doi:<a href="https://doi.org/10.18637/jss.v080.i01">10.18637/jss.v080.i01</a>.
</p>
</div>
</div>
<p>
Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the <strong>mgcv</strong> package. I use these models all the time in my research but recently weāve been hitting the limits of the range of models that <strong>mgcv</strong> can fit. So Iāve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The <strong>brms</strong> package <span class="citation" data-cites="brms-2017">(BĆ¼rkner, 2017)</span> is an excellent resource for modellers, providing a high-level R front end to a vast array of model types, all fitted using <a href="http://mc-stan.org">Stan</a>. <strong>brms</strong> is the perfect package to go beyond the limits of <strong>mgcv</strong> because <strong>brms</strong> even uses the smooth functions provided by <strong>mgcv</strong>, making the transition easier. In this post I take a look at how to fit a simple GAM in <strong>brms</strong> and compare it with the same model fitted using <strong>mgcv</strong>.
</p>
<p>
In this post weāll use the following packages. If you donāt know <strong>schoenberg</strong>, itās a package Iām writing to provide <code>ggplot</code> versions of plots that can be produced by <strong>mgcv</strong> from fitted GAM objects. <strong>schoenberg</strong> is in early development, but it currently works well enough to plot the models we fit here. If youāve never come across this package before, you can install it from Github using <code>devtools::install_github(āgavinsimpson/schoenbergā)</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## packages</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'brms'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'schoenberg'</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span></code></pre>
</figure>
<p>
To illustrate <strong>brms</strong>ās GAM-fitting chops, weāll use the <code>mcycle</code> data set that comes with the <strong>MASS</strong> package. It contains a set of measurements of the acceleration force on a riderās head during a simulated motorcycle collision and the time, in milliseconds, post collision. The data are loaded using <code>data()</code> and we take a look at the first few rows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## load the example data mcycle</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">mcycle</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'MASS'</span><span class="p">)</span><span class="w">
</span><span class="c1">## show data</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">mcycle</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> times accel
1 2.4 0.0
2 2.6 -1.3
3 3.2 -2.7
4 3.6 0.0
5 4.0 -2.7
6 6.2 -2.7</code></pre>
</figure>
<p>
The aim is to model the acceleration force (<code>accel</code>) as a function of time post collision (<code>times</code>). The plot below shows the data.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mcycle</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">accel</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Miliseconds post impact"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Acceleration (g)"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Simulated Motorcycle Accident"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Measurements of head acceleration"</span><span class="p">)</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/fitting-gams-with-brms-plot-data-1.png" />
</p>
<p>
Weāll model acceleration as a <em>smooth</em> function of time using a GAM and the default thin plate regression spline basis. This can be done using the <code>gam()</code> function in <strong>mgcv</strong> and, for comparison with the fully bayesian model weāll fit shortly, we use `method = āREMLā to estimate the smoothness parameter for the spline in mixed model form using REML
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">accel</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mcycle</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
accel ~ s(times)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25.546 1.951 -13.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(times) 8.625 8.958 53.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.783 Deviance explained = 79.7%
-REML = 616.14 Scale est. = 506.35 n = 133</code></pre>
</figure>
<p>
As we can see from the model summary, the estimated smooth uses about 8.5 effective degrees of freedom and in the test of zero effect, the null hypothesis is strongly rejected. The fitted spline explains about 80% of the variance or deviance in the data.
</p>
<p>
To plot the fitted smooth we could use the <code>plot()</code> method provided by <strong>mgcv</strong>, but this uses base graphics. Instead we can use the <code>draw()</code> method from <strong>schoenberg</strong>, which can currently handle most of the univariate smooths in <strong>mgcv</strong> plus 2-d tensor product smooths
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">draw</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/fitting-gams-with-brms-plot-mgcv-model-1.png" />
</p>
<p>
The equivalent model can be estimated using a fully-bayesian approach via the <code>brm()</code> function in the <strong>brms</strong> package. In fact, <code>brm()</code> will use the smooth specification functions from <strong>mgcv</strong>, making our lives much easier. The major difference though is that you canāt use <code>te()</code> or <code>ti()</code> smooths in <code>brm()</code> models; you need to use <code>t2()</code> tensor product smooths instead. This is because the smooths in the model are going to be treated as random effects and the model is estimated as a GLMM, which exploits the duality of splines as random effects. In this representation, the wiggly parts of the spline basis are treated as a random effect and their associated variance parameter controls the degree of wiggliness of the fitted spline. The perfectly smooth parts of the basis are treated as a fixed effect. In this form, the GAM can be estimated using standard GLMM software; itās what allows the <code>gamm4()</code> function to fit GAMMs using the <strong>lme4</strong> package for example. This is also the reason why we canāt use <code>te()</code> or <code>ti()</code> smooths; those smooths do not have nicely separable penalties which means they canāt be written in the form required to be fitted using typical mixed model software.
</p>
<p>
The <code>brm()</code> version of the GAM is fitted using the code below. Note that I have changed a few things from their default values as
</p>
<ol type="1">
<li>
the model required more than the default number of MCMC samples ā <code>iter = 4000</code>,
</li>
<li>
the samples needed thinning to deal with some strong autocorrelation in the Markov chains ā <code>thin = 10</code>,
</li>
<li>
the <code>adapt.delta</code> parameter, a tuning parameter in the NUTS sampler for Hamiltonian Monte Carlo, potentially needed raising ā there was a warning about a potential divergent transition but I should have looked to see if it was one or not; instead I just increased the tuning parameter to <code>0.99</code>,
</li>
<li>
four chains fitted by default but I wanted these to be fitted using 4 CPU <code>cores</code>,
</li>
<li>
<code>seed</code> sets the internal random number generator seed, which allows reproducibility of models, and
</li>
<li>
for this post I didnāt want to print out the progress of the sampler ā <code>refresh = 0</code> ā typically you wonāt want to do this so you can see how sampling is progressing.
</li>
</ol>
<p>
The rest of the model is pretty similar to the <code>gam()</code> version we fitted earlier. The main difference is that I use the <code>bf()</code> function to create a special <strong>brms</strong> formula specifying the model. You donāt actually need to do this for such a simple model, but in a later post weāll use this to fit distributional GAMs. Note that Iām leaving all the priors in the model at the default values. Iāll look at defining priors in a later post; for now Iām just going to use the default priors that <code>brm()</code> uses
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brm</span><span class="p">(</span><span class="n">bf</span><span class="p">(</span><span class="n">accel</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">times</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mcycle</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gaussian</span><span class="p">(),</span><span class="w"> </span><span class="n">cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">17</span><span class="p">,</span><span class="w">
</span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4000</span><span class="p">,</span><span class="w"> </span><span class="n">warmup</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="n">thin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">adapt_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.99</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Compiling the C++ model</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Start sampling</code></pre>
</figure>
<p>
Once the model has finished compiling and sampling we can output the model summary
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Family: gaussian
Links: mu = identity; sigma = identity
Formula: accel ~ s(times)
Data: mcycle (Number of observations: 133)
Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 10;
total post-warmup samples = 1200
ICs: LOO = NA; WAIC = NA; R2 = NA
Smooth Terms:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sds(stimes_1) 722.44 198.12 450.17 1150.27 1180 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept -25.54 2.02 -29.66 -21.50 1200 1.00
stimes_1 16.10 38.20 -61.46 90.91 1171 1.00
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma 22.78 1.47 19.94 25.68 1200 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).</code></pre>
</figure>
<p>
This outputs details of the model fitted plus parameter estimates (as posterior means), standard errors, (by default) 95% credible intervals and two other diagnostics:
</p>
<ol type="1">
<li>
<code>Eff.Sample</code> is the effective sample size of the posterior samples in the model, and
</li>
<li>
<code>Rhat</code> is the <em>potential scale reduction factor</em> or Gelman-Rubin diagnostic and is a measure of how well the chains have converged and ideally should be equal to <code>1</code>.
</li>
</ol>
<p>
The summary includes two entries for the smooth of <code>times</code>:
</p>
<ol type="1">
<li>
<code>sds(stimes_1)</code> is the variance parameter, which has the effect of controlling the wiggliness of the smooth ā the larger this value the more wiggly the smooth. We can see that the credible interval doesnāt include 0 so there is evidence that a smooth is required over and above a linear parametric effect of <code>times</code>, details of which are given next,
</li>
<li>
<code>stimes_1</code> is the fixed effect part of the spline, which is the linear function that is perfectly smooth.
</li>
</ol>
<p>
The final parameter table includes information on the variance of the data about the conditional mean of the response.
</p>
<p>
How does this model compare with the one fitted using <code>gam()</code>? We can use the <code>gam.vcomp()</code> function to compute the variance component representation of the smooth estimated via <code>gam()</code>. To make it comparable with the value shown for the <strong>brms</strong> model, we donāt undo the rescaling of the penalty matrix that <code>gam()</code> performs to help with numeric stability during model fitting.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">gam.vcomp</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">rescale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Standard deviations and 0.95 confidence intervals:
std.dev lower upper
s(times) 807.88726 480.66162 1357.88215
scale 22.50229 19.85734 25.49954
Rank: 2/2</code></pre>
</figure>
<p>
This gives a posterior mean of 807.89 with 95% confidence interval of 480.66ā1357.88, which compares well with posterior mean and credible interval of the <code>brm()</code> version of 722.44 (450.17 ā 1150.27).
</p>
<p>
The <code>marginal_smooths()</code> function is used to extract the marginal effect of the spline.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">msms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">marginal_smooths</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span></code></pre>
</figure>
<p>
This function extracts enough information about the estimated spline to plot it using the <code>plot()</code> method
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">msms</span><span class="p">)</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/fitting-gams-with-brms-plot-marginal-smooths-1.png" />
</p>
<p>
Given the similarity in the variance components of the two models it is not surprising the two estimated smooth also look similar. The <code>marginal_smooths()</code> function is effectively the equivalent of the <code>plot()</code> method for <strong>mgcv</strong>-based GAMs.
</p>
<p>
Thereās a lot that we can and should do to check the model fit. For now, weāll look at two posterior predictive check plots that <strong>brms</strong>, via the <strong>bayesplot</strong> package <span class="citation" data-cites="bayesplot-2018">(Gabry and Mahr, 2018)</span>, makes very easy to produce using the <code>pp_check()</code> function.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pp_check</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Using 10 posterior samples for ppc type 'dens_overlay' by default.</code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/fitting-gams-with-brms-pp-check-density-1.png" />
</p>
<p>
The default produces a density plot overlay of the original response values (the thick black line) with 10 draws from the posterior distribution of the model. If the model is a good fit to the data, samples of data sampled from it at the observed values of the covariate(s) should be similar to one another.
</p>
<p>
Another type of posterior predictive check plot is the empirical cumulative distribution function of the observations and random draws from the model posterior, which we can produce with <code>type = āecdf_overlayā</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pp_check</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ecdf_overlay"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Using 10 posterior samples for ppc type 'ecdf_overlay' by default.</code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/fitting-gams-with-brms-pp-check-ecdf-1.png" />
</p>
<p>
Both plots show significant deviations between the the posterior simulations and the observed data. The poor posterior predictive check results are in large part due to the non-constant variance of the acceleration data conditional upon the covariate. Both models assumed that the observation are distributed Gaussian with means equal to the fitted values (estimated expectation of the response) with the same variance <span class="math inline">(^2)</span>. The observations appear to have different variances, which we can model with a distributional model, which allow all parameters of the distribution of the response to be modelled with linear predictors. Weāll take a look at these models in a future post.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-brms-2017">
<p>
BĆ¼rkner, P.-C. (2017). brms: An R package for bayesian multilevel models using Stan. <em>Journal of Statistical Software</em> 80, 1ā28. doi:<a href="https://doi.org/10.18637/jss.v080.i01">10.18637/jss.v080.i01</a>.
</p>
</div>
<div id="ref-bayesplot-2018">
<p>
Gabry, J., and Mahr, T. (2018). <em>Bayesplot: Plotting for bayesian models</em>. Available at: <a href="https://CRAN.R-project.org/package=bayesplot">https://CRAN.R-project.org/package=bayesplot</a>.
</p>
</div>
</div>
Comparing smooths in factor-smooth interactions II
Gavin L. Simpson
2017-12-14T10:00:00-06:00
2017-12-14T10:00:00-06:00
https://www.fromthebottomoftheheap.net/2017/12/14/difference-splines-ii/
<p>
In a <a href="https://www.fromthebottomoftheheap.net/2017/10/10/difference-splines-i/">previous post</a> I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using <code>s()</code>ās <code>by</code> argument. When a common-or-garden factor variable is passed to <code>by</code>, <code>gam()</code> estimates a separate smooth for each <em>level</em> of the <code>by</code> factor. Using the <span class="math inline">(Xp)</span> matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the <code>by</code> variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an <em>ordered</em> factor for the <code>by</code> variable.
</p>
<p>
In a <a href="https://www.fromthebottomoftheheap.net/2017/10/10/difference-splines-i/">previous post</a> I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using <code>s()</code>ās <code>by</code> argument. When a common-or-garden factor variable is passed to <code>by</code>, <code>gam()</code> estimates a separate smooth for each <em>level</em> of the <code>by</code> factor. Using the <span class="math inline">(Xp)</span> matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the <code>by</code> variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an <em>ordered</em> factor for the <code>by</code> variable.
</p>
<p>
When an <em>ordered</em> factor is passed to <code>by</code>, <strong>mgcv</strong> does something quite different to the model I described previously, although the end results should be similar. What <strong>mgcv</strong> does in the <em>ordered</em> factor case is to fit <span class="math inline">(L-1)</span> <em>difference smooths</em>, where <span class="math inline">(l = 1, , L)</span> are the levels of the factor and <span class="math inline">(L)</span> the number of levels. These smooths model the difference between the smooth estimated for the reference level and the <span class="math inline">(l)</span>th level of the factor. Additionally, the <code>by</code> variable smooth doesnāt itself estimate the smoother for the reference level; so we are required to add a second smooth to the model that estimates that particular smooth.
</p>
<p>
In pseudo code our model would be something like, for ordered factor <code>of</code>,
</p>
<div id="cb1" class="sourceCode">
<pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1">model <-<span class="st"> </span><span class="kw">gam</span>(y <span class="op">~</span><span class="st"> </span>of <span class="op">+</span><span class="st"> </span><span class="kw">s</span>(x) <span class="op">+</span><span class="st"> </span><span class="kw">s</span>(x, <span class="dt">by =</span> of), <span class="dt">data =</span> df)</a></code></pre>
</div>
<p>
As with any <code>by</code> factor smooth we are required to include a parametric term for the factor because the individual smooths are centered for identifiability reasons. The first <code>s(x)</code> in the model is the smooth effect of <code>x</code> on the <em>reference</em> level of the ordered factor <code>of</code>. The second smoother, <code>s(x, by = of)</code> is the set of <span class="math inline">(L-1)</span> <em>difference</em> smooths, which model the smooth differences between the reference level smoother and those of the individual levels (excluding the reference one).
</p>
<p>
Note that this model still estimates a separator smoother for each level of the ordered factor, it just does it in a different way. The smoother for the reference level is estimated via contribution from <code>s(x)</code> <em>only</em>, whilst the smoothers for the other levels are formed from the additive combination of <code>s(x)</code> and the relevant difference smoother from the set created by <code>s(x, by = of)</code>. This is analogous to the situation we have when estimating an ANOVA using the default contrasts and <code>lm()</code>; the intercept is then an estimate of the mean response for the reference level of the factor, and the remaining model coefficients estimate the <em>differences</em> between the mean response of the reference level and that of the other factor levels.
</p>
<p>
This <em>ordered-factor-smooth interaction</em> is most directly applicable to situations where you have a reference category and you are interested in difference between that category and the other levels. If you are interested in pair-wise comparison of smooths you could use the ordered factor approach ā it may be more parsimonious than estimating separate smoothers for each level ā but you will still need to post-process the results in a manner similar to that described in the previous post<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>.
</p>
<p>
To illustrate the ordered factor difference smooths, Iāll reuse the example from the <em>Geochimica</em> <a href="http://doi.org/10.1016/j.gca.2010.12.026">paper</a> I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner <span class="citation" data-cites="Rose2012-pl">(Rose et al., 2012)</span>, and which formed the basis for the previous post.
</p>
<p>
Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and non-eroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesized, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the <a href="http://doi.org/10.1016/j.gca.2010.12.026">paper</a> ā if youāre interested but donāt have access to the journal, send me an email and Iāll pass on a pdf.
</p>
<p>
Below I make use of the following packages
</p>
<ul>
<li>
<strong>readr</strong>
</li>
<li>
<strong>dplyr</strong>
</li>
<li>
<strong>ggplot2</strong>, and
</li>
<li>
<strong>mgcv</strong>
</li>
</ul>
<p>
Youāll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'readr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span></code></pre>
</figure>
<p>
Next, load the data set and convert the <code>SiteCode</code> variable to a factor
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">uri</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://gist.githubusercontent.com/gavinsimpson/eb4ff24fa9924a588e6ee60dfae8746f/raw/geochimica-metals.csv'</span><span class="w">
</span><span class="n">metals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(</span><span class="n">uri</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col_types</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ciccd'</span><span class="p">))</span><span class="w">
</span><span class="n">metals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">metals</span><span class="p">,</span><span class="w"> </span><span class="n">SiteCode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">SiteCode</span><span class="p">))</span></code></pre>
</figure>
<p>
This is a subset of the data used in <span class="citation" data-cites="Rose2012-pl">Rose et al.Ā (2012)</span> ā the Hg concentrations in the sediments for just three of the lochs are included here in the interests of simplicity. The data set contains 5 variables
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">metals</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 44 x 5
SiteCode Date SoilType Region Hg
<fctr> <int> <chr> <chr> <dbl>
1 CHNA 2000 thin NW 3.843399
2 CHNA 1990 thin NW 5.424618
3 CHNA 1980 thin NW 8.819730
4 CHNA 1970 thin NW 11.417457
5 CHNA 1960 thin NW 16.513540
6 CHNA 1950 thin NW 16.512047
7 CHNA 1940 thin NW 11.188840
8 CHNA 1930 thin NW 11.622222
9 CHNA 1920 thin NW 13.645853
10 CHNA 1910 thin NW 11.181711
# ... with 34 more rows</code></pre>
</figure>
<ul>
<li>
<code>SiteCode</code> is a factor indexing the three lochs, with levels <code>CHNA</code>, <code>FION</code>, and <code>NODH</code>,
</li>
<li>
<code>Date</code> is a numeric variable of sediment age per sample,
</li>
<li>
<code>SoilType</code> and <code>Region</code> are additional factors for the (natural) experimental design, and
</li>
<li>
<code>Hg</code> is the response variable of interest, and contains the Hg concentration of each sediment sample.
</li>
</ul>
<p>
Neil gave me permission to make these data available openly should you want to try this approach out for yourself. If you make use of the data for other purposes, please cite the source publication <span class="citation" data-cites="Rose2012-pl">(Rose et al., 2012)</span> and recognize the contribution of the data creators; Handong Yang, Simon Turner, and Neil Rose.
</p>
<p>
To proceed, we need to create an ordered factor. Here Iām going to use the <code>SoilType</code> variable as that is easier to relate to conditions of the soil (rather than the Site Code I used in the previous post). I set the <code>non-eroded</code> level to be the reference and as such the GAM will estimate a full smooth for that level and then smooth differences between the <code>non-eroded</code>, and each of the <code>eroded</code> and <code>thin</code> lakes.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">metals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">metals</span><span class="p">,</span><span class="w">
</span><span class="n">oSoilType</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ordered</span><span class="p">(</span><span class="n">SoilType</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'non-eroded'</span><span class="p">,</span><span class="s1">'eroded'</span><span class="p">,</span><span class="s1">'thin'</span><span class="p">)))</span></code></pre>
</figure>
<p>
The ordered-factor GAM is fitted to the three lochs using the following
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">Hg</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">oSoilType</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">oSoilType</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">metals</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">)</span></code></pre>
</figure>
<p>
and the resulting smooths can be drawn using the <code>plot()</code> method
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">shade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">pages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">seWithMean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/difference-smooths-ii-plot-smooths-1.png" alt="Estimated smooth trend for the non-eroded site (top, left), and difference smooths reflecting estimated differences between the non-eroded site and the eroded site (top, right) and thin soil site (bottom, left), respectively." />
<figcaption>
Estimated smooth trend for the non-eroded site (top, left), and difference smooths reflecting estimated differences between the non-eroded site and the eroded site (top, right) and thin soil site (bottom, left), respectively.
</figcaption>
</figure>
<p>
The smooth in the top left is the reference smooth trend for the <code>non-eroded</code> site. The other two smooths are the difference smooths between the <code>non-eroded</code> and <code>eroded</code> sites (top right).
</p>
<p>
It is immediately clear that the difference between the non-eroded and eroded sites is not significant under this model. The estimated difference is linear, which suggests the trend in the eroded site is stronger than the one estimated for the non-eroded site. However, this difference is not so large as to be an identifiably different trend.
</p>
<p>
The difference smooth for the thin soil site is considerably different to that estimated for the non-eroded site; the principal difference being the much reduced trend in the thin soil site, as indicated by the difference smooth acting in opposition to the estimated trend for the non-eroded site.
</p>
<p>
A nice feature of the ordered factor approach is that inference on these difference can be performed formally and directly using the <code>summary()</code> output of the estimated GAM
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
Hg ~ oSoilType + s(Date) + s(Date, by = oSoilType)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.2231 0.6789 19.478 < 2e-16 ***
oSoilType.L -1.6948 1.1608 -1.460 0.15399
oSoilType.Q -4.2847 1.1990 -3.573 0.00114 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Date) 4.843 5.914 10.862 2.67e-07 ***
s(Date):oSoilTypeeroded 1.000 1.000 0.471 0.498
s(Date):oSoilTypethin 3.047 3.779 10.091 1.84e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.76 Deviance explained = 82.1%
-REML = 126.5 Scale est. = 20.144 n = 44</code></pre>
</figure>
<p>
The impression we formed about the differences in trends are reinforced with actual test statistics; this is a clear advantage of the ordered-factor approach <em>if</em> your problem suits this <em>different from reference</em> situation.
</p>
<p>
One feature to note, because we used an ordered factor, the parametric term for <code>oSoilType</code> uses polynomial contrasts: the <code>.L</code> and <code>.Q</code> refer to the linear and quadratic terms used to represent the factor. This is not as easy to identify differences in mean Hg concentration. If you want to retain that readily interpreted parameterisation, use the <code>SoilType</code> factor for the parametric part:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">Hg</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">SoilType</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Date</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">oSoilType</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">metals</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
Hg ~ SoilType + s(Date) + s(Date, by = oSoilType)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.722 1.213 13.788 4.88e-15 ***
SoilTypenon-eroded -4.049 1.684 -2.405 0.022115 *
SoilTypethin -6.446 1.681 -3.835 0.000553 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Date) 4.843 5.914 10.862 2.67e-07 ***
s(Date):oSoilTypeeroded 1.000 1.000 0.471 0.498
s(Date):oSoilTypethin 3.047 3.779 10.091 1.84e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.76 Deviance explained = 82.1%
-REML = 125.95 Scale est. = 20.144 n = 44</code></pre>
</figure>
<p>
Now the output in the parametric terms section is easier to interpret yet we retain the behavior of the reference smooth plus difference smooths part of the fitted GAM.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Rose2012-pl">
<p>
Rose, N. L., Yang, H., Turner, S. D., and Simpson, G. L. (2012). An assessment of the mechanisms for the transfer of lead and mercury from atmospherically contaminated organic soils to lake sediments with particular reference to scotland, UK. <em>Geochimica et cosmochimica acta</em> 82, 113ā135. doi:<a href="https://doi.org/10.1016/j.gca.2010.12.026">10.1016/j.gca.2010.12.026</a>.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
Except now you need to be sure to include the right set of basis functions that correspond to the pair of levels you want to compare. <em>You canāt do that with the function I included in that post; it requires something a bit more sophisticated, but the principles are the same</em>.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
First steps with MRF smooths
Gavin L. Simpson
2017-10-19T12:00:00-06:00
2017-10-19T12:00:00-06:00
https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/
<p>
One of the specialist smoother types in the <strong>mgcv</strong> package is the Markov Random Field (MRF) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. Iāve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result Iāve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial <a href="https://pudding.cool/process/regional_smoothing/">Regional Smoothing in R</a> from The Pudding. In this post I take a quick look at how to use the MRF smooth in <strong>mgcv</strong> to model a discrete spatial data set from the US Census Bureau.
</p>
<p>
One of the specialist smoother types in the <strong>mgcv</strong> package is the Markov Random Field (MRF) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. Iāve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result Iāve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial <a href="https://pudding.cool/process/regional_smoothing/">Regional Smoothing in R</a> from The Pudding. In this post I take a quick look at how to use the MRF smooth in <strong>mgcv</strong> to model a discrete spatial data set from the US Census Bureau.
</p>
<p>
In that tutorial, the example data are taken from the US Census Bureau via a shapefile prepared by the author. After a little munging ā quite a few steps are missing from the tutorial ā I managed to get data from the shapefile that matched what was used in the tutorial. The data are on county level percentages of US adults whose highest level of education attainment is a high school diploma. The raw data are shown in the figure below
</p>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/first-steps-with-mrf-smooths-hsd-data-plot-1.png" />
</p>
<p>
To follow along, youāll need to download the example <a href="https://github.com/polygraph-cool/smoothing_tutorial/blob/master/us_county_hs_only.zip">shapefile</a> provided by the author of the post on The Pudding. The shapefile(s) are in a ZIP, which I extracted into the working directory; the code below assumes this.
</p>
<p>
This post will make use of the following set of package; load them now, as shown below, and install any that you may be missing
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'rgdal'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'proj4'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'spdep'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'viridis'</span><span class="p">)</span></code></pre>
</figure>
<p>
Assuming you have extracted the shapefile, we load it into R using <code>readOGR()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">shp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readOGR</span><span class="p">(</span><span class="s1">'.'</span><span class="p">,</span><span class="w"> </span><span class="s1">'us_county_hs_only'</span><span class="p">)</span></code></pre>
</figure>
<p>
and do some data munging
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## select only mainland US counties</span><span class="w">
</span><span class="n">states</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">4</span><span class="o">:</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="o">:</span><span class="m">13</span><span class="p">,</span><span class="w"> </span><span class="m">16</span><span class="o">:</span><span class="m">42</span><span class="p">,</span><span class="w"> </span><span class="m">44</span><span class="o">:</span><span class="m">51</span><span class="p">,</span><span class="w"> </span><span class="m">53</span><span class="o">:</span><span class="m">56</span><span class="p">)</span><span class="w">
</span><span class="n">shp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">shp</span><span class="p">[</span><span class="n">shp</span><span class="o">$</span><span class="n">STATEFP</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">sprintf</span><span class="p">(</span><span class="s1">'%02i'</span><span class="p">,</span><span class="w"> </span><span class="n">states</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">droplevels</span><span class="p">(</span><span class="n">as</span><span class="p">(</span><span class="n">shp</span><span class="p">,</span><span class="w"> </span><span class="s1">'data.frame'</span><span class="p">))</span><span class="w">
</span><span class="c1">## project data</span><span class="w">
</span><span class="n">aea.proj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96"</span><span class="w">
</span><span class="n">shp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">spTransform</span><span class="p">(</span><span class="n">shp</span><span class="p">,</span><span class="w"> </span><span class="n">CRS</span><span class="p">(</span><span class="n">aea.proj</span><span class="p">))</span><span class="w"> </span><span class="c1"># project to Albers</span><span class="w">
</span><span class="n">shpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fortify</span><span class="p">(</span><span class="n">shp</span><span class="p">,</span><span class="w"> </span><span class="n">region</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'GEOID'</span><span class="p">)</span><span class="w">
</span><span class="c1">## Need a proportion for fitting</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">hsd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hs_pct</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">100</span><span class="p">)</span></code></pre>
</figure>
<p>
The shapefile contains US Census Bureau data for all US counties, including many that are far from the continental USA. The tutorial from The Pudding doesnāt go into how they removed, or how they drew a map without these additional counties. For our purposes they may cause complications when we try to model them using the MRF smooth. Iām sure the modelling approach can handle data like this, but as I wanted to achieve something that followed the tutorial Iāve removed everything not linked to the continental US landmass, including (Iām sorry!), Alaska and Hawaii ā my <strong>ggplot</strong> and mapping skills arenāt yet good enough to move Alaska and Hawaii to the bottom left of such maps.
</p>
<p>
The data were projected using the Albers equal area projection and subsequently passed to the <code>fortify()</code> method from <strong>ggplot2</strong> to get a version of the county polygons suitable for plotting with that package.
</p>
<p>
Finally, I created a new variable <code>hsd</code> which is just the variable <code>hs_pct</code> divided by 100. This creates a proportion that weāll need for model fitting as youāll see shortly.
</p>
<p>
Before we can model these data with <code>gam()</code>, we need to create the supporting information that <code>gam()</code> will use to create the MRF smooth penalty. The penalty matrix in an MRF smooth is based on the neighbourhood structure of the observations. There are three ways to pass this information to <code>gam()</code>
</p>
<ol type="1">
<li>
as a list of polygons (not <code>SpatialPolygons</code>, I believe)
</li>
<li>
as a list containing the neighbourhood structure, or
</li>
<li>
the raw penalty matrix itself.
</li>
</ol>
<p>
Options 1 and 3 arenāt easily doable as far I can see ā <code>gam()</code> isnāt expecting the sort of object we created when we imported the shapefile and nobody wantās to build a penalty matrix by hand! Thankfully option 2, the neighbourhood structure is relatively easy to create. For that I use the <code>poly2nb()</code> function from the <strong>spdep</strong> package. This function takes a shapefile and works out which regions are neighbours of any other region by virtue of them sharing a border. To make sure everything matches up nicely in the way <code>gam()</code> wants this list, we specify that the region IDs should be the <code>GEOID</code>s from the original data set (the <code>GEOID</code> uniquely identifies each county) and we have to set the <code>names</code> attribute on the neighbouthood list to match these unique IDs
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">nb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">poly2nb</span><span class="p">(</span><span class="n">shp</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="o">$</span><span class="n">GEOID</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">nb</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">attr</span><span class="p">(</span><span class="n">nb</span><span class="p">,</span><span class="w"> </span><span class="s2">"region.id"</span><span class="p">)</span></code></pre>
</figure>
<p>
The result of the previous chunk is a list whose names map on to the levels of the <code>GEOID</code> factor. The values in each element of <code>nb</code> index the elements of <code>nb</code> that are neighbours of the current element
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">str</span><span class="p">(</span><span class="n">nb</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">List of 6
$ 19107: int [1:6] 1417 1464 1632 2277 2278 2851
$ 19189: int [1:6] 551 1414 2151 2452 2846 2849
$ 20093: int [1:7] 5 557 1064 1142 1437 1441 2978
$ 20123: int [1:5] 1469 1565 2648 2966 2977
$ 20187: int [1:7] 3 554 1142 1441 1620 2142 2238
$ 21005: int [1:7] 582 583 953 954 1770 1861 2169</code></pre>
</figure>
<p>
With that done we can now fit the GAM. Fitting this is going to take a wee while (over 3 hours for the full rank MRF, using 6 threads, on a reasonably powerful 3-year old workstation with dual 4-core Xeon processors). To specify an MRF smooth we use the <code>bs</code> argument to the <code>s()</code> function, setting it to <code>bs = āmrfā</code>. The neighbourhood list is passed via the <code>xt</code> argument, which takes a list as a value; here we specify a component <code>nb</code> which takes our neighbourhood list <code>nb</code>. The final set-up variable to consider is whether to fit a full rank MRF, where a coefficient for each county will be estimated, or a reduced rank MRF, wherein the MRF is represented using fewer coefficients and counties are mapped to the smaller set of coefficients. The rank of the MRF smooth is set using the <code>k</code> argument. The default is to fit a full rank MRF, whilst setting <code>k < NROW(data)</code> will result ins a reduced-rank MRF being etimated.
</p>
<p>
The full rank MRF model is estimated using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ctrl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam.control</span><span class="p">(</span><span class="n">nthreads</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w"> </span><span class="c1"># use 6 parallel threads, reduce if fewer physical CPU cores</span><span class="w">
</span><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">hsd</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">GEOID</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mrf'</span><span class="p">,</span><span class="w"> </span><span class="n">xt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">nb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nb</span><span class="p">)),</span><span class="w"> </span><span class="c1"># define MRF smooth</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w">
</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">,</span><span class="w"> </span><span class="c1"># fast version of REML smoothness selection</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">betar</span><span class="p">())</span><span class="w"> </span><span class="c1"># fit a beta regression</span></code></pre>
</figure>
<p>
As the response is a proportion, the fitted GAM uses the beta distribution as the conditional distribution of the response. The default link in the logit, just as it is in for the binomial distribution, and insures that fitted values on the scale of the linear predictor are mapped onto the allowed range for proportions of 0ā1.
</p>
<p>
The final model uses in the region of 1700 effective degrees of freedom. This is the smoothness penalty at work; rather than 3108 individual coefficients, the smoothness invoked to try to arrange for neighbouring counties to have similar coefficients has shrunk away almost half of the complexity implied by the full rank MRF.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: Beta regression(179.532)
Link function: logit
Formula:
hsd ~ s(GEOID, bs = "mrf", xt = list(nb = nb))
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.63806 0.00283 -225.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(GEOID) 1732 3107 9382 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.769 Deviance explained = 89.7%
-REML = -4544 Scale est. = 1 n = 3108</code></pre>
</figure>
<p>
Whilst the penalty enforces smoothness, further smoothness can be enforced by fitting a reduced rank MRF. In the next code block I fit models with <code>k = 300</code> and <code>k = 30</code> respectively, which imply considerable smoothing relative to the full rank model.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## rank 300 MRF</span><span class="w">
</span><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">hsd</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">GEOID</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mrf'</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">,</span><span class="w"> </span><span class="n">xt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">nb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nb</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">betar</span><span class="p">())</span><span class="w">
</span><span class="c1">## rank 30 MRF</span><span class="w">
</span><span class="n">m3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">hsd</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">GEOID</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mrf'</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">xt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">nb</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nb</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'REML'</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">betar</span><span class="p">())</span></code></pre>
</figure>
<p>
To visualise the different fits we need to generate predicted values on the response scale for each county and add this data to the county data <code>df</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w">
</span><span class="n">mrfFull</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'response'</span><span class="p">),</span><span class="w">
</span><span class="n">mrfRrank300</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'response'</span><span class="p">),</span><span class="w">
</span><span class="n">mrfRrank30</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m3</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'response'</span><span class="p">))</span></code></pre>
</figure>
<p>
Before we can plot these fitted values we need to merge <code>df</code> with the fortified shapefile
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## merge data with fortified shapefile</span><span class="w">
</span><span class="n">mdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">shpf</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'id'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'GEOID'</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Column `id`/`GEOID` joining character vector and factor, coercing
into character vector</code></pre>
</figure>
<p>
To facilitate plotting with <strong>ggplot2</strong> I begin by creating some fixed plot components, like the theme, scale, and labels
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">theme_map</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">theme_minimal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">...</span><span class="p">,</span><span class="w">
</span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.border</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">myTheme</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">theme_map</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'bottom'</span><span class="p">)</span><span class="w">
</span><span class="n">myScale</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">scale_fill_viridis</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'%'</span><span class="p">,</span><span class="w"> </span><span class="n">option</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'plasma'</span><span class="p">,</span><span class="w">
</span><span class="n">limits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="m">0.55</span><span class="p">),</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w">
</span><span class="n">guide</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">guide_colorbar</span><span class="p">(</span><span class="n">direction</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"horizontal"</span><span class="p">,</span><span class="w">
</span><span class="n">barheight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mm"</span><span class="p">),</span><span class="w">
</span><span class="n">barwidth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="m">75</span><span class="p">,</span><span class="w"> </span><span class="n">units</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"mm"</span><span class="p">),</span><span class="w">
</span><span class="n">title.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'left'</span><span class="p">,</span><span class="w">
</span><span class="n">title.hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w">
</span><span class="n">label.hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">))</span><span class="w">
</span><span class="n">myLabs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'US Adult Education'</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'% of adults where high school diploma is highest level education'</span><span class="p">,</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Source: US Census Bureau'</span><span class="p">)</span></code></pre>
</figure>
<p>
I took many of these settings from Timo Grossenbacherās excellent <a href="https://timogrossenbacher.ch/2016/12/beautiful-thematic-maps-with-ggplot2-only/">post on mapping regional demographic data in Switzerland</a>.
</p>
<p>
Now we can plot the fitted proportions. Note that whilst we plot proportions, the colour bar labels are in percentages in keeping with the original data (see the definition for <code>my_scale</code> to see how this was achieved).
</p>
<p>
Fitted values from the full rank MRF are shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mdata</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mrfFull</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'black'</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_equal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">myTheme</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">myScale</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">myLabs</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/first-steps-with-mrf-smooths-plot-full-rank-mrf-1.png" />
</p>
<p>
This model expains about 90% of the deviance in the original data. Whilst some smoothing is evident, the fitted values show a considerable about of non-spatial variation. This is most likely due to not including important covariates, such as country average income, which might explain some of the finer scale structure; neighbouring counties with quite different proportions. A more considered analysis would include these and other relevant predictors alongside the MRF.
</p>
<p>
Smoother surfaces can be achieved via the reduced rank MRFs. First the rank 300 MRF
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mdata</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mrfRrank300</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'black'</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_equal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">myTheme</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">myScale</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">myLabs</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/first-steps-with-mrf-smooths-plot-rank-300-mrf-1.png" />
</p>
<p>
and next the rank 30 MRF
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">mdata</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mrfRrank30</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'black'</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_equal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">myTheme</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">myScale</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">myLabs</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/first-steps-with-mrf-smooths-plot-rank-30-mrf-1.png" />
</p>
<p>
As can be clearly seen from the plots, the degree of smoothness can be controlled effectively via the <code>k</code> argument.
</p>
<p>
In a future post Iāll take a closer look at using MRFs alongside other covariates as part of a model complex spatial modeling exercise.
</p>
Comparing smooths in factor-smooth interactions I
Gavin L. Simpson
2017-10-10T16:00:00-06:00
2017-10-10T16:00:00-06:00
https://www.fromthebottomoftheheap.net/2017/10/10/difference-splines-i/
<p>
One of the really appealing features of the <strong>mgcv</strong> package for fitting GAMs is the functionality it exposes for fitting quite complex models, models that lie well beyond what many of us may have learned about what GAMs can do. One of those features that I use a lot is the ability to model the smooth effects of some covariate <span class="math inline">(x)</span> in the different levels of a factor. Having estimated a separate smoother for each level of the factor, the obvious question is, which smooths are different? In this post Iāll take a look at one way to do this using <code>by</code>-variable smooths.
</p>
<p>
One of the really appealing features of the <strong>mgcv</strong> package for fitting GAMs is the functionality it exposes for fitting quite complex models, models that lie well beyond what many of us may have learned about what GAMs can do. One of those features that I use a lot is the ability to model the smooth effects of some covariate <span class="math inline">(x)</span> in the different levels of a factor. Having estimated a separate smoother for each level of the factor, the obvious question is, which smooths are different? In this post Iāll take a look at one way to do this using <code>by</code>-variable smooths.
</p>
<p>
With <strong>mgcv</strong>, smooths are included in model formulae using the <code>s()</code> function. If you want to have the smooth equivalent of a continuous-factor interaction, one way to achieve this is via the <code>by</code> argument to <code>s()</code>. If you pass a factor to <code>by</code>, <strong>mgcv</strong> sets up the model matrix in such a way that you get a separate smoother for each level of the <code>by</code> factor. Each of these smoothers gets its own smoothness parameter ā so you can fit a wiggly function in level <em>foo</em> and a smooth function in level <em>bar</em>, with each levelās function being learned from the data associated with that level.
</p>
<p>
I used this technique in a <a href="http://doi.org/10.1016/j.gca.2010.12.026">paper</a> I wrote with my colleagues at UCL, Neil Rose, Handong Yang, and Simon Turner <span class="citation" data-cites="Rose2012-pl">(Rose et al., 2012)</span>. Neil, Handong, and Simon had collected sediment cores from several Scottish lochs and measured metal concentrations, especially of lead (Pb) and mercury (Hg), in sediment slices covering the last 200 years. The aim of the study was to investigate sediment profiles of these metals in three regions of Scotland; north east, north west, and south west. A pair of lochs in each region was selected, one in a catchment with visibly eroding peat/soil, and the other in a catchment without erosion. The different regions represented variations in historical deposition levels, whilst the hypothesis was that cores from eroded and non-eroded catchments would show differential responses to reductions in emissions of Pb and Hg to the atmosphere. The difference, it was hypothesised, was that the eroding soil acts as a secondary source of pollutants to the lake. You can read more about it in the <a href="http://doi.org/10.1016/j.gca.2010.12.026">paper</a> ā if youāre interested but donāt have access to the journal, send me an email and Iāll pass on a pdf.
</p>
<p>
It was relatively simple to fit splines to each sediment profile, but once Iād done this, how were we going to estimate the difference between the fitted trends? Thankfully, I already had the answer as Simon Wood had supplied code to do it to an OP on the R-Help listserver some years previous. That answer involved <code>by</code>-variable smoothers, which I was already using, and the use of the <span class="math inline">(Xp)</span> matrix of the fitted GAM.
</p>
<p>
Readers of this blog will have heard about the <span class="math inline">(Xp)</span> matrix before; itās used a lot when we want to simulate from the posterior of the estimated model. Importantly, for our purposes, it allows for the creation of derived quantities, from the fitted model, and the assignment of uncertainty to those quantities.
</p>
<p>
In this post Iāll illustrate how to do the required comparison using some of the data from that study on Scottish lochs.
</p>
<p>
In this post Iāll use the the following packages
</p>
<ul>
<li>
<strong>readr</strong>
</li>
<li>
<strong>dplyr</strong>
</li>
<li>
<strong>ggplot2</strong>, and
</li>
<li>
<strong>mgcv</strong>
</li>
</ul>
<p>
Youāll more than likely have these installed, but if you get errors about missing packages when you run the code chunk below, install any missing packages and run the chunk again
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'readr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'mgcv'</span><span class="p">)</span></code></pre>
</figure>
<p>
Next, load the data set and convert the <code>SiteCode</code> variable to a factor for use in fitting the GAM with <code>gam()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">uri</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'https://gist.githubusercontent.com/gavinsimpson/eb4ff24fa9924a588e6ee60dfae8746f/raw/geochimica-metals.csv'</span><span class="w">
</span><span class="n">metals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_csv</span><span class="p">(</span><span class="n">uri</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col_types</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ciccd'</span><span class="p">))</span><span class="w">
</span><span class="n">metals</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">metals</span><span class="p">,</span><span class="w"> </span><span class="n">SiteCode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">SiteCode</span><span class="p">))</span></code></pre>
</figure>
<p>
This is a subset of the data used in <span class="citation" data-cites="Rose2012-pl">Rose et al.Ā (2012)</span> ā the Hg concentrations in the sediments for just three of the lochs are included here in the interests of simplicity. The data set contains 5 variables
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">metals</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 44 x 5
SiteCode Date SoilType Region Hg
<fctr> <int> <chr> <chr> <dbl>
1 CHNA 2000 thin NW 3.843399
2 CHNA 1990 thin NW 5.424618
3 CHNA 1980 thin NW 8.819730
4 CHNA 1970 thin NW 11.417457
5 CHNA 1960 thin NW 16.513540
6 CHNA 1950 thin NW 16.512047
7 CHNA 1940 thin NW 11.188840
8 CHNA 1930 thin NW 11.622222
9 CHNA 1920 thin NW 13.645853
10 CHNA 1910 thin NW 11.181711
# ... with 34 more rows</code></pre>
</figure>
<ul>
<li>
<code>SiteCode</code> is a factor indexing the three lochs, with levels <code>CHNA</code>, <code>FION</code>, and <code>NODH</code>,
</li>
<li>
<code>Date</code> is a numeric variable of sediment age per sample,
</li>
<li>
<code>SoilType</code> and <code>Region</code> are additional factors for the (natural) experimental design, and
</li>
<li>
<code>Hg</code> is the response variable of interest, and contains the Hg concentration of each sediment sample.
</li>
</ul>
<p>
Neil gave me permission to make these data available openly should you want to try this approach out for yourself. If you make use of the data for other purposes, please cite the source publication <span class="citation" data-cites="Rose2012-pl">(Rose et al., 2012)</span> and recognize the contribution of the data creators; Handong Yang, Simon Turner, and Neil Rose.
</p>
<p>
The data, with LOESS smoothers superimposed, are shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">metals</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Hg</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SiteCode</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'loess'</span><span class="p">,</span><span class="w"> </span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_brewer</span><span class="p">(</span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'qual'</span><span class="p">,</span><span class="w"> </span><span class="n">palette</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Dark2'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'top'</span><span class="p">)</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/difference-smooths-i-plot-data-1.png" />
</p>
<p>
Smooth-factor interactions can be estimated using <code>gam()</code> in a number of different ways. Here we use <code>by</code>-variable smooths. Each of the separate smooths is subject to identifiability constraints, which effectively centres each smooth around zero effect. As such, differences in the mean Hg concentrations of the lochs is not accounted for by the smooths. The rectify this weāll need to add <code>SiteCode</code> as a parametric term to the model, along with the smooths.
</p>
<p>
The GAM is fitted to the three sites, and the fit summarized, using the following code
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">Hg</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">SiteCode</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SiteCode</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">metals</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
Hg ~ SiteCode + s(Date, by = SiteCode)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.2970 0.7889 13.052 2.19e-12 ***
SiteCodeFION 2.3260 1.1163 2.084 0.048026 *
SiteCodeNODH 5.5587 1.3288 4.183 0.000332 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Date):SiteCodeCHNA 2.744 3.412 3.786 0.0187 *
s(Date):SiteCodeFION 5.711 6.861 18.745 7.66e-12 ***
s(Date):SiteCodeNODH 8.574 8.922 19.086 7.62e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.889 Deviance explained = 93.8%
GCV = 17.076 Scale est. = 9.3029 n = 44</code></pre>
</figure>
<p>
and the resulting smooths can be drawn using the <code>plot()</code> method
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">shade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">pages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/difference-smooths-i-plot-smooths-1.png" alt="Estimated smooths for each level of factor SiteCode" />
<figcaption>
Estimated smooths for each level of factor <code>SiteCode</code>
</figcaption>
</figure>
<h2 id="differences-of-smooths">
Differences of smooths
</h2>
<p>
To calculate the differences between pairs of the three smooths estimated in the model we need to be able to evaluate the smooths at a set of values of <code>Date</code>. Below we specify a fine gird of points over the time-scale of each core. This set of prediction data is passed to the <code>predict()</code> method and the <span class="math inline">(Xp)</span> matrix is requested with the option <code>type = ālpmatrixā</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">Date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1860</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">400</span><span class="p">),</span><span class="w">
</span><span class="n">SiteCode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'FION'</span><span class="p">,</span><span class="w"> </span><span class="s1">'CHNA'</span><span class="p">,</span><span class="w"> </span><span class="s1">'NODH'</span><span class="p">))</span><span class="w">
</span><span class="n">xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'lpmatrix'</span><span class="p">)</span></code></pre>
</figure>
<p>
The result, stored in <code>xp</code>, is a matrix where the basis functions of the model have been evaluated at the values of the covariates supplied to <code>newdata</code>. To turn this matrix into one containing fitted or predicted values it needs the be muliplied by the model coefficients and the rows summed. However, in this <span class="math inline">(Xp)</span> state we can compute differences between the evaluated smooths before computing fitted values.
</p>
<p>
This process needs to be repeated for each pair of smooths we want to compare ā this is a bit like all pair-wise post hoc comparisons. A number of steps are involved, which I break down below for the comparison of the smooths for <code>SiteCode == āCHNAā</code> and <code>SiteCode = āFIONā</code>. After Iāve gone through the steps, weāll wrap them all into a function which we can use to automated the process.
</p>
<p>
The first step is to identify which columns of <span class="math inline">(Xp)</span> relate to the smooths for the pair of levels of <code>SiteCode</code> we are comparing. The rows of the <span class="math inline">(Xp)</span> that contain the data for this pair of lochs also need to be identified.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## which cols of xp relate to splines of interest?</span><span class="w">
</span><span class="n">c1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="s1">'CHNA'</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">xp</span><span class="p">))</span><span class="w">
</span><span class="n">c2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="s1">'FION'</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">xp</span><span class="p">))</span><span class="w">
</span><span class="c1">## which rows of xp relate to sites of interest?</span><span class="w">
</span><span class="n">r1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">SiteCode</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'CHNA'</span><span class="p">)</span><span class="w">
</span><span class="n">r2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">SiteCode</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'FION'</span><span class="p">)</span></code></pre>
</figure>
<p>
Next, we subtract the elements of <span class="math inline">(Xp)</span> for the first loch from the elements of <span class="math inline">(Xp)</span> for the second loch. To focus on the difference between the pair of smooths, the columns of the differenced <span class="math inline">(Xp)</span> matrix (in <code>X</code>) that arenāt involved in comparison are set then to zero
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## difference rows of xp for data from comparison</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">xp</span><span class="p">[</span><span class="n">r1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">xp</span><span class="p">[</span><span class="n">r2</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="c1">## zero out cols of X related to splines for other lochs</span><span class="w">
</span><span class="n">X</span><span class="p">[,</span><span class="w"> </span><span class="o">!</span><span class="w"> </span><span class="p">(</span><span class="n">c1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">c2</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="c1">## zero out the parametric cols</span><span class="w">
</span><span class="n">X</span><span class="p">[,</span><span class="w"> </span><span class="o">!</span><span class="n">grepl</span><span class="p">(</span><span class="s1">'^s\\('</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">xp</span><span class="p">))]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span></code></pre>
</figure>
<p>
The first zeroing uses the logical indices for columns containing either <code>āCHNAā</code> or <code>āFIONā</code> ā if you had a model with additional smooths involving the <code>SiteCode</code> variable, youād need a more sophisticated way of identifying the columns of <span class="math inline">(Xp)</span> that relate to the smooths of interest. The second zeroing affects all the columns related to the parametric terms in the model. For this model these relate to the intercept and the two dummy contrasts associated with <code>SiteCode</code> in the model.
</p>
<p>
Having obtained a suitably modified <span class="math inline">(Xp)</span> matrix, predicted values using it can be obtained by multiplying the matrix by the estimated model coefficients and summing the result row-wise. This can be achieved in a single step using a matrix multiplication of the matrix <code>X</code> with the row vector of model coefficients.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">dif</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<p>
Because we zeroed out all the columns not involved directly in the pair of smooths we are comparing, this effectively turns their contributions to the fitted/predicted values to zero also. The result, stored in <code>dif</code>, is a vector of fitted <em>differences</em> between the pair of smooths we an interested in.
</p>
<p>
Having computed the difference, we want to know how uncertain the estimated difference is. Handily, we can compute the standard errors of the differences using the variance-covariance matrix of the estimated model coefficients. The standard errors are computed using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">rowSums</span><span class="p">((</span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">vcov</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">X</span><span class="p">))</span></code></pre>
</figure>
<p>
Note that the above assumes that smoothness parameters (which control how wiggly the individual smooths are) are known and fixed. In reality these smoothness parameters were estimated and hence the standard errors just computed are likely biased low. This could be corrected by passing <code>unconditional = TRUE</code> to <code>vcov()</code>.
</p>
<p>
Now that we have standard errors, a point-wise 1 - <span class="math inline">()</span> confidence interval can be created using the critical value of the <em>t</em> distribution with appropriate degrees of freedom (in the case of a Gaussian model; quantiles of the Gaussian distribution would be needed for other conditional distributions). For a 95% interval, we use the following code
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">crit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qt</span><span class="p">(</span><span class="m">.975</span><span class="p">,</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">m</span><span class="p">))</span><span class="w">
</span><span class="n">upr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dif</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">)</span><span class="w">
</span><span class="n">lwr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dif</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">)</span></code></pre>
</figure>
<p>
To allow for these steps to be repeated for all pairwise combinations, the process outlined above is best encapsulated as a function. One such function is shown below, where arguments <code>f1</code>, <code>f2</code>, and <code>var</code> refer to length 1 character vectors specifying the first and second levels of the factor and the name of the <code>by</code>-variable factor respectively.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">smooth_diff</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">f1</span><span class="p">,</span><span class="w"> </span><span class="n">f2</span><span class="p">,</span><span class="w"> </span><span class="n">var</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.05</span><span class="p">,</span><span class="w">
</span><span class="n">unconditional</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'lpmatrix'</span><span class="p">)</span><span class="w">
</span><span class="n">c1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="n">f1</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">xp</span><span class="p">))</span><span class="w">
</span><span class="n">c2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grepl</span><span class="p">(</span><span class="n">f2</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">xp</span><span class="p">))</span><span class="w">
</span><span class="n">r1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">newdata</span><span class="p">[[</span><span class="n">var</span><span class="p">]]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">f1</span><span class="w">
</span><span class="n">r2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">newdata</span><span class="p">[[</span><span class="n">var</span><span class="p">]]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">f2</span><span class="w">
</span><span class="c1">## difference rows of xp for data from comparison</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">xp</span><span class="p">[</span><span class="n">r1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">xp</span><span class="p">[</span><span class="n">r2</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="c1">## zero out cols of X related to splines for other lochs</span><span class="w">
</span><span class="n">X</span><span class="p">[,</span><span class="w"> </span><span class="o">!</span><span class="w"> </span><span class="p">(</span><span class="n">c1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">c2</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="c1">## zero out the parametric cols</span><span class="w">
</span><span class="n">X</span><span class="p">[,</span><span class="w"> </span><span class="o">!</span><span class="n">grepl</span><span class="p">(</span><span class="s1">'^s\\('</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">xp</span><span class="p">))]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">dif</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w">
</span><span class="n">se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">rowSums</span><span class="p">((</span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">vcov</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">unconditional</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unconditional</span><span class="p">))</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">X</span><span class="p">))</span><span class="w">
</span><span class="n">crit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qt</span><span class="p">(</span><span class="n">alpha</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">model</span><span class="p">),</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">upr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dif</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">)</span><span class="w">
</span><span class="n">lwr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dif</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">pair</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">f1</span><span class="p">,</span><span class="w"> </span><span class="n">f2</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'-'</span><span class="p">),</span><span class="w">
</span><span class="n">diff</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dif</span><span class="p">,</span><span class="w">
</span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se</span><span class="p">,</span><span class="w">
</span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upr</span><span class="p">,</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwr</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
To complete the pairwise comparison of the estimated smooths, we use the function on the three combinations of pairs of smooths and gather the results into a tidy object <code>comp</code> suitable for plotting with <strong>ggplot2</strong>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">comp1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">smooth_diff</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="s1">'FION'</span><span class="p">,</span><span class="w"> </span><span class="s1">'CHNA'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SiteCode'</span><span class="p">)</span><span class="w">
</span><span class="n">comp2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">smooth_diff</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="s1">'FION'</span><span class="p">,</span><span class="w"> </span><span class="s1">'NODH'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SiteCode'</span><span class="p">)</span><span class="w">
</span><span class="n">comp3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">smooth_diff</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="s1">'CHNA'</span><span class="p">,</span><span class="w"> </span><span class="s1">'NODH'</span><span class="p">,</span><span class="w"> </span><span class="s1">'SiteCode'</span><span class="p">)</span><span class="w">
</span><span class="n">comp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1860</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">400</span><span class="p">),</span><span class="w">
</span><span class="n">rbind</span><span class="p">(</span><span class="n">comp1</span><span class="p">,</span><span class="w"> </span><span class="n">comp2</span><span class="p">,</span><span class="w"> </span><span class="n">comp3</span><span class="p">))</span></code></pre>
</figure>
<p>
The pairwise differences of smooths and associated confidence intervals can be plotted using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">comp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">diff</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pair</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">pair</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_cartesian</span><span class="p">(</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-30</span><span class="p">,</span><span class="m">30</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Difference in Hg trend'</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/difference-smooths-i-plot-difference-smmoths-1.png" alt="Estimated differences of trends in sediment Hg concentration for pairs of Scottish lochs" />
<figcaption>
Estimated differences of trends in sediment Hg concentration for pairs of Scottish lochs
</figcaption>
</figure>
<p>
Where the confidence interval excludes zero, we might infer significant differences between a pari of estimated smooths.
</p>
<h2 id="conclusions">
Conclusions
</h2>
<p>
Regular readers will be familiar with the <span class="math inline">(Xp)</span> matrix; Iāve used this for simulating from the posterior distribution of an estimated GAM, and for computing simultaneous intervals for smoothers, among other things. Here, it is used to compute difference between smooths. The <span class="math inline">(Xp)</span> matrix is quite versatile; learning how to use it effectively will allow you to compute all manner of derived quantities related to an estimated GAM.
</p>
<p>
The <code>by</code>-variable type of factor-smooth interaction is just one of the ways of estimating different smooth effects for each level of a factor. One of the potential disadvantages of this type of smoother is it is quite wasteful to estimate three different smooths, each with its own smoothness parameter. More parsimonious ways of fitting factor-smooth interactions are possible with <strong>mgcv</strong>, and Iāll look at an alternative option in the next post.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Rose2012-pl">
<p>
Rose, N. L., Yang, H., Turner, S. D., and Simpson, G. L. (2012). An assessment of the mechanisms for the transfer of lead and mercury from atmospherically contaminated organic soils to lake sediments with particular reference to scotland, UK. <em>Geochimica et cosmochimica acta</em> 82, 113ā135. doi:<a href="https://doi.org/10.1016/j.gca.2010.12.026">10.1016/j.gca.2010.12.026</a>.
</p>
</div>
</div>
Fitting count and zero-inflated count GLMMs with mgcv
Gavin L. Simpson
2017-05-04T13:45:00-06:00
2017-05-04T13:45:00-06:00
https://www.fromthebottomoftheheap.net/2017/05/04/compare-mgcv-with-glmmTMB/
<p>
A couple of days ago, Mollie Brooks and coauthors posted a <a href="http://doi.org/10.1101/132753">preprint</a> on <a href="http://biorxiv.org/">BioRĻiv</a> illustrating the use of the <strong>glmmTMB</strong> R package for fitting zero-inflated GLMMs <span class="citation" data-cites="Brooks2017-so">(Brooks et al., 2017)</span>. In the paper, <strong>glmmTMB</strong> is compared with several other GLMM-fitting packages. <strong>mgcv</strong> has recently gained the ability to fit a wider range of families beyond the exponential family of distributions, including zero-inflated Poisson models. <strong>mgcv</strong> can also fit simple GLMMs through a spline equivalent of a Gaussian random effect. So, whilst I was waiting on some Bayesian GAMs to finish sampling, I decided to see how <strong>mgcv</strong> compared against <strong>glmmTMB</strong> on the two examples used in the paper.
</p>
<div id="refs" class="references">
<div id="ref-Brooks2017-so">
<p>
Brooks, M. E., Kristensen, K., Benthem, K. J. van, Magnusson, A., Berg, C. W., Nielsen, A., et al.Ā (2017). Modeling Zero-Inflated count data with glmmTMB. <em>bioRxiv</em>, 132753. doi:<a href="https://doi.org/10.1101/132753">10.1101/132753</a>.
</p>
</div>
</div>
<p>
A couple of days ago, Mollie Brooks and coauthors posted a <a href="http://doi.org/10.1101/132753">preprint</a> on <a href="http://biorxiv.org/">BioRĻiv</a> illustrating the use of the <strong>glmmTMB</strong> R package for fitting zero-inflated GLMMs <span class="citation" data-cites="Brooks2017-so">(Brooks et al., 2017)</span>. In the paper, <strong>glmmTMB</strong> is compared with several other GLMM-fitting packages. <strong>mgcv</strong> has recently gained the ability to fit a wider range of families beyond the exponential family of distributions, including zero-inflated Poisson models. <strong>mgcv</strong> can also fit simple GLMMs through a spline equivalent of a Gaussian random effect. So, whilst I was waiting on some Bayesian GAMs to finish sampling, I decided to see how <strong>mgcv</strong> compared against <strong>glmmTMB</strong> on the two examples used in the paper.
</p>
<p>
For this post Iāll be using a couple of packages beyond <strong>glmmTMB</strong> and <strong>mgcv</strong>; make sure you have <strong>ggplot2</strong> and <strong>ggstance</strong> installed if you wish to run through the code below.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"glmmTMB"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggstance"</span><span class="p">)</span></code></pre>
</figure>
<p>
There are several ways in which <strong>mgcv</strong> allows GLMMs to be fitted, but the way that interests me here is via <code>gam()</code> and the <em>random effect</em> spline basis. Penalised splines of the type provided in <strong>mgcv</strong> can also be represented in mixed model form, such that GAMs can also be fitted using mixed effect modelling software. The general idea is that the spline is decomposed into two parts:
</p>
<ol type="1">
<li>
the perfectly smooth parts of the basis, namely those functions, including constant and linear functions, in the penalty null space of the spline. These are added to the fixed effects model matrix, whilst,
</li>
<li>
the remaining wiggly parts of the basis are treated as random effects.
</li>
</ol>
<p>
Given this duality between splines and random effects, you can reverse the idea and create a spline basis that is the equivalent of a simple Gaussian i.i.d random effect, such that you can fit a GLMM or GAMM using GAM software like <strong>mgcv</strong>. <strong>mgcv</strong> has the <code>re</code> basis for this, and Iāll exploit that to fit the zero-inflated GLMMs to the two examples.
</p>
<p>
In <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span>, two example data sets are used;
</p>
<ol type="1">
<li>
<code>Salamanders</code> ā Seven combinations of different salamander species and life-stages were repeatedly sampled four times at 23 sites in Applachian streams <span class="citation" data-cites="Price2016-no">(Price et al., 2016)</span>. Some of the streams were impacted by mountaintop removal and valley filling from coal mining. The data are available from <span class="citation" data-cites="Price2015-se">Price et al.Ā (2015)</span>, as well as the <strong>glmmTMB</strong> package.
</li>
<li>
<code>Owls</code> ā the second example is a well-studied one in mixed modelling papers and textbooks <span class="citation" data-cites="Zuur2009-vg">(Zuur et al., 2009, <span class="citation" data-cites="Bolker2013">(<span class="citeproc-not-found" data-reference-id="Bolker2013"><strong>???</strong></span>)</span>āvl)</span>, and relates to the begging behaviour of owl nestlings. The data were originally reported in <span class="citation" data-cites="Roulin2007-rq">Roulin and Bersier (2007)</span>.
</li>
</ol>
<h3 id="salamanders">
Salamanders
</h3>
<p>
<span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span> fit several count models to the <code>Salamander</code> data set, including standard Poisson GLMMs, negative binomial GLMMs, with <span class="math inline">()</span> estimated and modelled via a linear predictor, as well as zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) models. Of these, <code>gam()</code> can currently fit all but the negative binomial with <span class="math inline">()</span> modelled via a linear predictor and the ZINB models.
</p>
<p>
The best fitting model of those presented was a negative binomial model, whilst <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span> also illustrate how to generate fitted values from the ZIP. Rather than go through fitting all of the <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span> models, I restrict fitting here to these two models. A <a href="https://gist.github.com/gavinsimpson/8a0f0e072b095295cf5f7af2762e05a7">gist</a> with code to fit all the models that <code>gam()</code> is capable of is available on Github. I have named the models similarly to <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span> to facilitate comparison.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">nbgam2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">count</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">spp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">mined</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">site</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"re"</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Salamanders</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nb</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ML"</span><span class="p">)</span><span class="w">
</span><span class="n">nbm2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glmmTMB</span><span class="p">(</span><span class="n">count</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">spp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">mined</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">site</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Salamanders</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nbinom2</span><span class="p">)</span></code></pre>
</figure>
<p>
As <code>glmmTMB()</code> is currently only capable of fitting models using maximum likelihood, not REML, I use the Laplace approximate maximum likelihood estimation method for <code>gam()</code>. The new <code>nb</code> family in <strong>mgcv</strong> is for the negative binomial distribution with the (fixed) dispersion parameter <span class="math inline">()</span> estimated as a model parameter, in the same way that <code>MASS::glm.nb()</code> and <code>lme4::glmer.nb()</code> models do.
</p>
<p>
In the <code>gam()</code> model, the random effect is specified using the standard <code>s()</code> smooth function with the <code>āreā</code> basis selected. The named variable, here <code>site</code>, should be stored as a factor in the data object to avoid problems.
</p>
<p>
The figure below compares the coefficient estimates returned by <code>glmmTMB()</code> and <code>gam()</code>; they are very similar, which is encouraging.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">nb2.coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">estimate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">coef</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">nbm2</span><span class="p">))</span><span class="o">$</span><span class="n">cond</span><span class="p">[,</span><span class="w"> </span><span class="s2">"Estimate"</span><span class="p">],</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">nbgam2</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">14</span><span class="p">)]),</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"glmmTMB"</span><span class="p">,</span><span class="w"> </span><span class="s2">"mgcv::gam"</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">14</span><span class="p">),</span><span class="w">
</span><span class="n">term</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">coef</span><span class="p">(</span><span class="n">nbgam2</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">14</span><span class="p">)]),</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">nb2.coefs</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">term</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">position_dodgev</span><span class="p">(</span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Regression estimate"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Comparing mgcv with glmmTMB"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Salamander: Negative Binomial"</span><span class="p">)</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/compare-mgcv-with-glmmtmb-salamander-nb2-coefs-1.png" />
</p>
<p>
The values (posterior modes, or means) for the <code>site</code> random effect can also be compared
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">nbgam2.r</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">nbgam2</span><span class="p">)[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">14</span><span class="p">)]</span><span class="w">
</span><span class="n">nbm2.r</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ranef</span><span class="p">(</span><span class="n">nbm2</span><span class="p">)</span><span class="o">$</span><span class="n">cond</span><span class="o">$</span><span class="n">site</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">nms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sub</span><span class="p">(</span><span class="s2">"s\\(site\\)\\."</span><span class="p">,</span><span class="w"> </span><span class="s2">"Site "</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">nbgam2.r</span><span class="p">))</span><span class="w">
</span><span class="n">ranefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">ranef</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">unname</span><span class="p">(</span><span class="n">nbgam2.r</span><span class="p">),</span><span class="w"> </span><span class="n">nbm2.r</span><span class="p">),</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"glmmTMB"</span><span class="p">,</span><span class="w"> </span><span class="s2">"mgcv::gam"</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">nbgam2.r</span><span class="p">)),</span><span class="w">
</span><span class="n">site</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">nms</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">ranefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">ranefs</span><span class="p">,</span><span class="w"> </span><span class="n">site</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">site</span><span class="p">,</span><span class="w"> </span><span class="n">nms</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">nbgam2.r</span><span class="p">)]))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">ranefs</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ranef</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">site</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">position_dodgev</span><span class="p">(</span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Random effect"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Comparing mgcv with glmmTMB"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Salamanders: Negative Binomial"</span><span class="p">)</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/compare-mgcv-with-glmmtmb-slamander-nb2-ranefs-1.png" />
</p>
<p>
As the figure above shows, these too are essentially equivalent for the two fits.
</p>
<p>
The <code>summary()</code> output for the <code>glmmTMB()</code> model conveniently provides some additional useful information, in the context of GLMMs most notably the estimated variances (or standard deviations) of the random effect terms. As <code>gam()</code> wasnāt designed with GLMMs specifically in mind, the same information is not provided in the the <code>summary()</code> method for <code>gam()</code> model fits. However, Simon Wood has provided the <code>gam.vcomp()</code> function, which can be used to return the variance components of the model in a way that allows comparison with other mixed-models specific software.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">nbm2</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Family: nbinom2 ( log )
Formula: count ~ spp * mined + (1 | site)
Data: Salamanders
AIC BIC logLik deviance df.resid
1663.4 1734.8 -815.7 1631.4 628
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
site (Intercept) 0.2842 0.5331
Number of obs: 644, groups: site, 23
Overdispersion parameter for nbinom2 family (): 1
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.3750 0.7576 -4.455 8.40e-06 ***
sppPR 0.9306 0.8773 1.061 0.288829
sppDM 2.2485 0.7878 2.854 0.004314 **
sppEC-A 0.7143 0.9052 0.789 0.430029
sppEC-L 1.8130 0.8130 2.230 0.025741 *
sppDES-L 2.5111 0.7795 3.221 0.001275 **
sppDF 2.5765 0.7801 3.303 0.000957 ***
minedno 4.1619 0.7932 5.247 1.55e-07 ***
sppPR:minedno -2.5831 0.9328 -2.769 0.005617 **
sppDM:minedno -2.1495 0.8258 -2.603 0.009245 **
sppEC-A:minedno -1.5828 0.9461 -1.673 0.094339 .
sppEC-L:minedno -1.3383 0.8493 -1.576 0.115100
sppDES-L:minedno -1.9358 0.8164 -2.371 0.017729 *
sppDF:minedno -2.7426 0.8217 -3.338 0.000844 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
</figure>
<p>
Now the <code>gam()</code> version, conveniently with a confidence interval
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">gam.vcomp</span><span class="p">(</span><span class="n">nbgam2</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Standard deviations and 0.95 confidence intervals:
std.dev lower upper
s(site) 0.5325309 0.327768 0.8652132
Rank: 1/1</code></pre>
</figure>
<p>
One further analysis that <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span> do with the <code>Salamander</code> data (in their Appendix B) is to demonstrate how to generate and plot fitted values from the model. To do this, the analyst needs to consider whether to and how to marginalise over or condition on the random effects. The Appendix has some details on this more generally (via a linked reference) and more-specific pointers on how to go about doing this with <code>glmmTMB()</code> models. In the next few code chunks I will illustrate how to achieve the result from their section <em>Alternative prediction method</em>, where the aim is to predict at the population mode by setting the random effect component to 0. To illustrate this, <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span> use the more complex ZIP model with linear predictors for both the mean and the zero-inflation components of the model. I fit those models first
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## glmmTMB()</span><span class="w">
</span><span class="n">zipm3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glmmTMB</span><span class="p">(</span><span class="n">count</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">spp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">mined</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">site</span><span class="p">),</span><span class="w"> </span><span class="n">zi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">spp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">mined</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Salamanders</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">)</span><span class="w">
</span><span class="c1">## gam()</span><span class="w">
</span><span class="n">zipgam3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">count</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">spp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">mined</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">site</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"re"</span><span class="p">),</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">spp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">mined</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Salamanders</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ziplss</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
The <code>glmmTMB()</code> model has the zero-inflation linear predictor specified via the <code>ziformula</code> argument (abbreviated to <code>zi</code> above). With <code>gam()</code> however, multiple linear predictors are specified via a list of formula objects, only the first of which has a response (left-hand-side). The first formula, with the response, is for the Poisson mean, whilst the second is for the zero-inflation component. Note also that we use the special <code>ziplss()</code> family and that now the model is being estimated using REML, because that is the only option available for these models, which Simon Wood calls <strong>general smooth models</strong> <span class="citation" data-cites="Wood2016-fx">(Wood et al., 2016)</span>. Do note that there is (as of writing) no <code>link</code> argument(s) for the <code>ziplss()</code> family. This is due to the way the model is parameterised internally in the software. This will require us to pay particular attention to the implementation shortly.
</p>
<p>
To recreate part of Figure B.3 in Appendix B <span class="citation" data-cites="Brooks2017-so">(Brooks et al., 2017)</span>, the code below predicts from the fitted <code>gam()</code> model for all combinations of the factors <code>mined</code> and <code>spp</code>. Notice how we have to specify a <code>site</code> in the prediction data, otherwise <code>predict()</code> will throw a tantrum. To set the random effect for <code>site</code> to zero, use the <code>exclude</code> argument. To exclude (i.e.Ā set to zero) any model term, you supply a character vector or list of terms to <code>exclude</code>. For smooth terms, these must be named as they appear in <code>summary(model)</code>, hence the use of <code>ās(site)ā</code>. The final step is to call <code>predict()</code> with <code>type = ālinkā</code>. This will return a two column matrix (or a list of two-column matrices if <code>se.fit = TRUE</code> is also used).
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## Newdata</span><span class="w">
</span><span class="n">newd0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">unique</span><span class="p">(</span><span class="n">Salamanders</span><span class="p">[,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"mined"</span><span class="p">,</span><span class="s2">"spp"</span><span class="p">)]),</span><span class="w"> </span><span class="n">site</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"R -1"</span><span class="p">))</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">newd0</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rownames</span><span class="p">(</span><span class="n">newd</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">zipgam3</span><span class="p">,</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">exclude</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"s(site)"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"link"</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">pred</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> [,1] [,2]
1 0.36061171 -3.7727169
2 0.94857203 0.3875087
3 0.07601834 -2.6504763
4 0.35637762 -1.3459854
5 0.55867674 -1.4747253
6 1.14836206 0.3266343</code></pre>
</figure>
<p>
The first column is the predicted value of the response from the Poisson part of the model <em>on the scale of the linear predictor</em> (the log scale). The second column is the predicted value from the zero-inflation component and is on the complementary log-log scale. Both of these need to be back transformed to the respective response scales and then multiplied together. To do this for the zero-inflation part, I copied the code from the base R <code>binomial()</code> family with the appropriate link specified. The second line of code below adds the predicted values for each combination of <code>mined</code> and <code>spp</code> to the prediction data object. Note that each component is back-transformed using the appropriate link, and then multiplied together.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ilink</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">binomial</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cloglog"</span><span class="p">)</span><span class="o">$</span><span class="n">linkinv</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">pred</span><span class="p">[,</span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">pred</span><span class="p">[,</span><span class="m">2</span><span class="p">]))</span></code></pre>
</figure>
<p>
A plot of the predicted values is then easily produced
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">spp</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mined</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span></code></pre>
</figure>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/compare-mgcv-with-glmmtmb-salamander-population-mode-plot-3-1.png" />
</p>
<p>
Because of the way the <code>gam()</code> model is implemented, I could also have computed the Bayesian credible intervals using the Bayesian covariance matrix of the model parameters via the <code>se.fit</code> argument to <code>predict()</code>. Iāll perhaps save that for another dayā¦
</p>
<h3 id="owls">
Owls
</h3>
<p>
The <code>Owls</code> data are also available in the <strong>glmmTMB</strong> package, which I load and then do a little processing of the data to simplify the name of the response variable and to mean centre the <code>ArrivalTime</code> covariate.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="p">(</span><span class="n">Owls</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"glmmTMB"</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">Owls</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sub</span><span class="p">(</span><span class="s2">"SiblingNegotiation"</span><span class="p">,</span><span class="w"> </span><span class="s2">"NCalls"</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">Owls</span><span class="p">))</span><span class="w">
</span><span class="n">Owls</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">Owls</span><span class="p">,</span><span class="w"> </span><span class="n">cArrivalTime</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ArrivalTime</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">ArrivalTime</span><span class="p">))</span></code></pre>
</figure>
<p>
Two ZIP models are considered
</p>
<ol type="1">
<li>
a ZIP with constant zero-inflation (an intercept-only model for the zero-inflation), and
</li>
<li>
a ZIP with complex zero-inflation, where one covariate and a random effect for <code>Nest</code> are included in the linear predictor of the zero-inflation part of the model.
</li>
</ol>
<p>
The constant zero-inflation models are fitted using the <code>ziformula</code> argument for <code>glmmTMB</code> with <code>family = poisson</code>, whilst for <code>gam()</code> we use a list of two formula objects, the second for the ZI linear predictor, and the <code>ziplss</code> family. Note that this model could also be fitted using the <code>Zip()</code> family in <strong>mgcv</strong> but that employs a different, simpler fitting algorithm so to facilitate comparison with the more complex model I use <code>ziplss()</code> instead.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1.tmb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glmmTMB</span><span class="p">(</span><span class="n">NCalls</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">FoodTreatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">cArrivalTime</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">SexParent</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">offset</span><span class="p">(</span><span class="n">logBroodSize</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Nest</span><span class="p">),</span><span class="w">
</span><span class="n">ziformula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Owls</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">)</span><span class="w">
</span><span class="n">m1.gam</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">NCalls</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">FoodTreatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">cArrivalTime</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">SexParent</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">offset</span><span class="p">(</span><span class="n">logBroodSize</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Nest</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"re"</span><span class="p">),</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Owls</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ziplss</span><span class="p">())</span></code></pre>
</figure>
<p>
Again note that these models are not estimated in the same way; <code>glmmTMB()</code> estimates the model parameters using maximum likelihood, whilst only REML estimation is available for the <code>ziplss()</code> family with <code>gam()</code>. In <code>gam()</code>, the intercept-only ZI linear predictor is specified with the formula <code>~ 1</code>.
</p>
<p>
To compare the estimates of the model coefficients I wrote a little function to extract the estimated values and their standard errors from the two model objects
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">createCoeftab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">TMB</span><span class="p">,</span><span class="w"> </span><span class="n">GAM</span><span class="p">,</span><span class="w"> </span><span class="n">GAMrange</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">bTMB</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fixef</span><span class="p">(</span><span class="n">TMB</span><span class="p">)</span><span class="o">$</span><span class="n">cond</span><span class="p">[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">bGAM</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">GAM</span><span class="p">)[</span><span class="n">GAMrange</span><span class="p">]</span><span class="w">
</span><span class="n">seTMB</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="n">vcov</span><span class="p">(</span><span class="n">TMB</span><span class="p">)</span><span class="o">$</span><span class="n">cond</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">seGAM</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="n">vcov</span><span class="p">(</span><span class="n">GAM</span><span class="p">))[</span><span class="n">GAMrange</span><span class="p">]</span><span class="w">
</span><span class="n">nms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">bTMB</span><span class="p">)</span><span class="w">
</span><span class="n">nms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sub</span><span class="p">(</span><span class="s2">"FoodTreatment"</span><span class="p">,</span><span class="w"> </span><span class="s2">"FT"</span><span class="p">,</span><span class="w"> </span><span class="n">nms</span><span class="p">)</span><span class="w">
</span><span class="n">nms</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sub</span><span class="p">(</span><span class="s2">"cArrivalTime"</span><span class="p">,</span><span class="w"> </span><span class="s2">"ArrivalTime"</span><span class="p">,</span><span class="w"> </span><span class="n">nms</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"glmmTMB"</span><span class="p">,</span><span class="w"> </span><span class="s2">"mgcv::gam"</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">bGAM</span><span class="p">)),</span><span class="w">
</span><span class="n">term</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">nms</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">estimate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unname</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">bTMB</span><span class="p">,</span><span class="w"> </span><span class="n">bGAM</span><span class="p">)))</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w">
</span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">estimate</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">seTMB</span><span class="p">,</span><span class="w"> </span><span class="n">seGAM</span><span class="p">)),</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">estimate</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">seTMB</span><span class="p">,</span><span class="w"> </span><span class="n">seGAM</span><span class="p">)))</span><span class="w">
</span><span class="n">df</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
Passing each of the models to <code>createCoeftab()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1.coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">createCoeftab</span><span class="p">(</span><span class="n">m1.tmb</span><span class="p">,</span><span class="w"> </span><span class="n">m1.gam</span><span class="p">,</span><span class="w"> </span><span class="n">GAMrange</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">6</span><span class="p">)</span></code></pre>
</figure>
<p>
results in a tidy data frame suitable for plotting with <code>ggplot()</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">m1.coefs</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">term</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">xmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">xmin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_pointrangeh</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">position_dodgev</span><span class="p">(</span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Regression estimate"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Comparing mgcv with glmmTMB"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Owls: ZIP with constant zero-inflation"</span><span class="p">,</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Bars are Ā±1 SE"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/compare-mgcv-with-glmmtmb-plot-m1-coefs-1.png" alt="Comparison of estimated model fixed effect parameters for the constant zer-inflation model fitted to the owl nestling behaviour data." />
<figcaption>
Comparison of estimated model fixed effect parameters for the constant zer-inflation model fitted to the owl nestling behaviour data.
</figcaption>
</figure>
<p>
As can be seen in the figure, the estimates from the two functions are quite similar.
</p>
<p>
The more-complex models with covariates in the ZI linear predictor are fitted next
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m2.tmb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glmmTMB</span><span class="p">(</span><span class="n">NCalls</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">FoodTreatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">cArrivalTime</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">SexParent</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">offset</span><span class="p">(</span><span class="n">logBroodSize</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Nest</span><span class="p">),</span><span class="w">
</span><span class="n">ziformula</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">FoodTreatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Nest</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Owls</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">)</span><span class="w">
</span><span class="n">m2.gam</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">NCalls</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">FoodTreatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">cArrivalTime</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">SexParent</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">offset</span><span class="p">(</span><span class="n">logBroodSize</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Nest</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"re"</span><span class="p">),</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">FoodTreatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Nest</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"re"</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Owls</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ziplss</span><span class="p">())</span></code></pre>
</figure>
<p>
As before, we gather the model coefficients
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m2.coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">createCoeftab</span><span class="p">(</span><span class="n">m2.tmb</span><span class="p">,</span><span class="w"> </span><span class="n">m2.gam</span><span class="p">,</span><span class="w"> </span><span class="n">GAMrange</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">:</span><span class="m">6</span><span class="p">)</span></code></pre>
</figure>
<p>
and plot them
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">m2.coefs</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">term</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">shape</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">model</span><span class="p">,</span><span class="w">
</span><span class="n">xmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">xmin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_pointrangeh</span><span class="p">(</span><span class="n">position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">position_dodgev</span><span class="p">(</span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Regression estimate"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Comparing mgcv with glmmTMB"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Owls: ZIP with complex zero-inflation"</span><span class="p">,</span><span class="w">
</span><span class="n">caption</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Bars are Ā±1 SE"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/compare-mgcv-with-glmmtmb-plot-m2-coefs-1.png" alt="Comparison of estimated model fixed effect parameters for the complex zer-inflation model fitted to the owl nestling behaviour data." />
<figcaption>
Comparison of estimated model fixed effect parameters for the complex zer-inflation model fitted to the owl nestling behaviour data.
</figcaption>
</figure>
<p>
and likewise as before, the estimates of the fixed effect terms are very similar indeed.
</p>
<h3 id="conclusions">
Conclusions
</h3>
<p>
The comparisons shown above show that <code>mgcv::gam()</code> and <code>glmmTMB()</code> produce very similar estimates for the two models. And some crude timings showed that <code>gam()</code> was 20ā40% faster than <code>glmmTMB()</code> at fitting the examples discussed in the paper. So all is roses, right!? Who needs <code>glmmTMB()</code>?
</p>
<p>
That would however, be totally the wrong message to take from this comparison. Most notably, and something that isnāt surfaced in these simple examples is that <code>gam()</code> is limited in the complexity of the random effects it can efficiently represent in models:
</p>
<ul>
<li>
it canāt do correlated random effects for random slopes and intercepts models (as far as I can tell anyway), and, and this is probably the deal breaker,
</li>
<li>
model fitting with <code>gam()</code> gets bogged down quickly if the number of levels in a random effect gets large. <a href="https://twitter.com/jaimedash">Jamie Ashander</a> did some quick tests with a larger version of the Salamander with 100s of <code>site</code>s and <code>glmmTMB()</code> totally dominated <code>gam()</code>.
</li>
</ul>
<p>
And thatās fine; <code>gam()</code> was not designed to fit GLMMs ā there are no less than <strong>three</strong> implementations <em>by Simon Wood alone</em> of functions to fit GAMs with complex random effects in mixed model software (<code>gamm()</code> to fit with <code>lme()</code>, <code>gamm4()</code> to fit using <code>lmer()</code> and <code>glmer()</code>, and <code>jagam()</code> in <strong>mgcv</strong> to fit via JAGS). Furthermore, <code>glmmTMB()</code> is currently more flexible in the range of models that it can fit than any these implementations, except for JAGS, because the <code>nb</code>, <code>Zip</code>, and <code>ziplss</code> families only work with <code>gam()</code>.
</p>
<p>
What the above comparison illustrates, however, is that if you either donāt have complex or many random effects or that you donāt mind running models overnight, <code>gam()</code> is a good option for fitting GLMMs. Plus you have the advantage of estimating smooth functions of covariates, which is one area where <code>glmmTMB()</code> is currently very lacking compared to <code>gam()</code>.
</p>
<p>
That said, it should be possible to emulate what Paul-Christian BĆ¼rkner has done in his <a href="https://cran.r-project.org/package=brms"><strong>brms</strong> package</a> (and similar implementations by Simon Wood in <code>gamm4()</code>) to use <strong>mgcv</strong> to set up the correct model matrices for the random effect representation of splines which can then be fitted using <code>glmmTMB()</code>.
</p>
<p>
Finally, this was a fun exercise to replicate the analyses in <span class="citation" data-cites="Brooks2017-so">Brooks et al.Ā (2017)</span>, motivated by a desire to understand what <strong>mgcv</strong> and <code>gam()</code> are doing with these random effect splines. It wasnāt intended as a prize-fight between two title contenders ā hopefully this write-up didnāt come across that way. I also learned a lot more about <strong>glmmTMB</strong>, which is shaping up nicely and looks like itāll have a place in my modelling toolbox.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Bolker2013-vl">
<p>
Bolker, B. M., Gardner, B., Maunder, M., Berg, C. W., Brooks, M., Comita, L., et al.Ā (2013). Strategies for fitting nonlinear ecological models in r, AD model builder, and BUGS. <em>Methods in ecology and evolution / British Ecological Society</em> 4, 501ā512. doi:<a href="https://doi.org/10.1111/2041-210X.12044">10.1111/2041-210X.12044</a>.
</p>
</div>
<div id="ref-Brooks2017-so">
<p>
Brooks, M. E., Kristensen, K., Benthem, K. J. van, Magnusson, A., Berg, C. W., Nielsen, A., et al.Ā (2017). Modeling Zero-Inflated count data with glmmTMB. <em>bioRxiv</em>, 132753. doi:<a href="https://doi.org/10.1101/132753">10.1101/132753</a>.
</p>
</div>
<div id="ref-Price2015-se">
<p>
Price, S. J., Muncy, B. L., Bonner, S. J., Drayer, A. N., and Barton, C. D. (2015). Data from: Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders. doi:<a href="https://doi.org/10.5061/dryad.5m8f6">10.5061/dryad.5m8f6</a>.
</p>
</div>
<div id="ref-Price2016-no">
<p>
Price, S. J., Muncy, B. L., Bonner, S. J., Drayer, A. N., and Barton, C. D. (2016). Effects of mountaintop removal mining and valley filling on the occupancy and abundance of stream salamanders. <em>The Journal of applied ecology</em> 53, 459ā468. doi:<a href="https://doi.org/10.1111/1365-2664.12585">10.1111/1365-2664.12585</a>.
</p>
</div>
<div id="ref-Roulin2007-rq">
<p>
Roulin, A., and Bersier, L.-F. (2007). Nestling barn owls beg more intensely in the presence of their mother than in the presence of their father. <em>Animal behaviour</em> 74, 1099ā1106.
</p>
</div>
<div id="ref-Wood2016-fx">
<p>
Wood, S. N., Pya, N., and SƤfken, B. (2016). Smoothing parameter and model selection for general smooth models. <em>Journal of the American Statistical Association</em> 111, 1548ā1563. doi:<a href="https://doi.org/10.1080/01621459.2016.1180986">10.1080/01621459.2016.1180986</a>.
</p>
</div>
<div id="ref-Zuur2009-vg">
<p>
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A., and Smith, G. M. (2009). <em>Mixed effects models and extensions in ecology with r:</em> Springer New York doi:<a href="https://doi.org/10.1007/978-0-387-87458-6">10.1007/978-0-387-87458-6</a>.
</p>
</div>
</div>
Prediction intervals for GLMs part II
Gavin L. Simpson
2017-05-01T09:00:00-06:00
2017-05-01T09:00:00-06:00
https://www.fromthebottomoftheheap.net/2017/05/01/glm-prediction-intervals-ii/
<p>
One of my more popular <a href="http://stackoverflow.com/a/14424417/429846">answers</a> on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). Comments, even on StackOverflow, arenāt a good place for a discussion so I thought Iād post something hereon my blog that went into a bit more detail as to why, for some common types of GLMs, prediction intervals arenāt that useful and require a lot more thinking about what they mean and how they should be calculated. Iāve broken it into two and in this, the second part, I look at Possion models.
</p>
<p>
One of my more popular <a href="http://stackoverflow.com/a/14424417/429846">answers</a> on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). Comments, even on StackOverflow, arenāt a good place for a discussion so I thought Iād post something hereon my blog that went into a bit more detail as to why, for some common types of GLMs, prediction intervals arenāt that useful and require a lot more thinking about what they mean and how they should be calculated. Iāve broken it into two and in this, the second part, I look at Possion models.
</p>
<p>
The second example ā purely because I happen to have it handy from teaching this semester ā is from <span class="citation" data-cites="Korner-Nievergelt2015-tk">Korner-Nievergelt et al.Ā (2015)</span>, and concerns the number of breeding pairs of the common whitethroat (<em>Silvia communis</em>). This species likes to inhabit field margins and fallow lands and has been adversely affected by intensive agricultural activities reducing these types of habitat on the landscape. As a mitigiation effort, wildflower fields are sown and left largely unmanaged for several years. The data come from a study looking at how the number of breeding pairs of common whitethroat change as the composition and structure of the plant community changes over time. The data are in the <strong>blmeco</strong> package available on CRAN.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## install.packages("blmeco") # first, if not already installed</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"blmeco"</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">wildflowerfields</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span></code></pre>
</figure>
<p>
The example in <span class="citation" data-cites="Korner-Nievergelt2015-tk">Korner-Nievergelt et al.Ā (2015)</span> uses a Poisson GLM with a quadratic effect of the variable <code>age</code>. Instead Iāll use a Poisson GAM, but in all other respects the analysis follows that from the text book (only the year 2007 data are used, field size transformed to hectares).
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">wf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">subset</span><span class="p">(</span><span class="n">wildflowerfields</span><span class="p">,</span><span class="w"> </span><span class="n">year</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2007</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">wf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">wf</span><span class="p">,</span><span class="w"> </span><span class="n">size.z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">size</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">size</span><span class="p">))</span><span class="w">
</span><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">bp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">size.z</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">offset</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">size</span><span class="p">)),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">wf</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: poisson
Link function: log
Formula:
bp ~ s(age, k = 6) + size.z + offset(log(size))
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1791 0.3333 -3.538 0.000404 ***
size.z -0.5283 0.2893 -1.826 0.067861 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(age) 2.608 3.223 7.323 0.074 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.207 Deviance explained = 46.3%
-REML = 33.589 Scale est. = 1 n = 41</code></pre>
</figure>
<p>
The primary variable of interest shows a moderate amount of non-linearity, similar to that of the quadratic effect of <code>age</code> in the version from the text book, though the effect of field age is weak at best. The fitted model is illustrated graphically below, holding <code>size</code> constant at the mean field size
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ilink</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">family</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span><span class="o">$</span><span class="n">linkinv</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">wf</span><span class="p">,</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">age</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">age</span><span class="p">),</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">),</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">size.z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"link"</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)))</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="p">),</span><span class="w"> </span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)),</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">wf</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bp</span><span class="o">/</span><span class="n">size</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">),</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_jitter</span><span class="p">(</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Age [years]"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="s2">"Number of Breeding Pairs ["</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pairs</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">ha</span><span class="o">^</span><span class="p">{</span><span class="m">-1</span><span class="p">}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="s2">"]"</span><span class="p">),</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Common Whitethroat densities in Wildflower Fields"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Estimated densities for average field of ~1.8 ha"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/glm-prediction-intervals-ii-plot-wildflowers-1.png" alt="The fitted GAM for the common whitethroat data, showing the estimated number of breeding pairs per hectare with a 95% pointwise confidence interval. The points are the observed densities of breeding pairs." />
<figcaption>
The fitted GAM for the common whitethroat data, showing the estimated number of breeding pairs per hectare with a 95% pointwise confidence interval. The points are the observed densities of breeding pairs.
</figcaption>
</figure>
<p>
So far so good, but how do we interpret this model? For simplicity, lets assume that fields only come in integer ages. What the model implies is that for each integer age the observations are best fitted by (or described by; or generated from) a Poisson model with parameter <span class="math inline">()</span> equal to the value of the solid line at each particular age. This value of <span class="math inline">()</span> is just an estimate of the true value and so we might envisage the observations for each year as having come from Poisson distributions with values of <span class="math inline">()</span> given by the values of the upper and lower confidence band also shown in the figure above. For fields of two and five years of age these distributions look like this
</p>
<p>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/glm-prediction-intervals-ii-implied-poisson-distributions-1.png" />
</p>
<p>
The fitted Poisson distributions for the two field ages are shown by the green points and lines in the figure above. The effect of field age is to shift the estimated Poisson distribution to the right, towards on average higher numbers of breeding pairs. The uncertainty in the estimated model is shown by the orange and blue points and lines; these are based on the lower and upper 95% pointwise confidence interval on the estimate mean number of breeding pairs for fields of two and five years of age. The orange points illustrate the Poisson distribution from which the points might have been derived if the true value of <span class="math inline">()</span> were at the lower end of the confidence interval. The blue points show the Poisson distribution if the true values of <span class="math inline">()</span> was at the upper end of the confidence interval. Each of these distributions implies, potentially at least, different predicted numbers of breeding pairs.
</p>
<p>
We have estimated the expected number of breeding pairs given the age of the field and itās size. We also have a (pointwise) 95% confidence interval on that expectation. As before, this isnāt a prediction interval, so what would one of those look like in this case? Somewhat similar to those we created for the binomial GLM earlier, except now we have posterior densities (the probability density implied by the Poisson distribution with <span class="math inline">()</span> given as a function of field age) for all the integers 0āā, although once we get above 10 breeding pairs the density is going to be effectively 0 even if not technically so.
</p>
<p>
Note that I said integers above; we canāt have 2.5 breeding pairs <em>as a prediction</em>. Hence any prediction interval is really talking about points of probability for each integer <span class="math inline">({0, 1, 2, })</span> (even if we might consider a much smaller upper limit than that) not a continuous interval. Having said that, perhaps Iām being to pedantic? In some instances, the upper and lower 2.5<sup>th</sup> and 97.5<sup>th</sup> probability quantiles of the implied Poisson distribution do begin to look more like a prediction interval.
</p>
<p>
To illustrate Iāll work my way through some code to illustrate some ways of thinking about what the fitted model says in terms of predicting the numbers of breeding pairs of common whitethroats. First a little bit of prep; Iāll illustrate various intervals for two hypothetical fields or average size, one created two years ago and a second five years ago.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">size.z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"fit"</span><span class="p">,</span><span class="w"> </span><span class="s2">"se"</span><span class="p">))</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">pred</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">)),</span><span class="w">
</span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="p">),</span><span class="w"> </span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">)))</span><span class="w">
</span><span class="n">p</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> age size size.z fit se lower lambda upper
1 2 1 0 -1.7184424 0.5805902 0.05615594 0.1793453 0.5727752
2 5 1 0 -0.1295676 0.3260103 0.45767855 0.8784752 1.6861589</code></pre>
</figure>
<p>
<code>p</code> contains the estimated value of <span class="math inline">()</span> (the expected number of breeding pairs), and the upper and lower 95% pointwise interval about this expected count, for the two fields. First, the 95% interval for the model estimated <span class="math inline">()</span> for the younger of the two fields, based on <code>qpois()</code>, the quantile function of the conditional distribution of the number of breeding pairs given field age
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qpois</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"lambda"</span><span class="p">])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0 1</code></pre>
</figure>
<p>
Hence we might expect either 0, 1 breeding pairs. But, we havenāt accounted for the uncertainty in the estimated <span class="math inline">()</span>. At the lower end of of the 95% interval on the estimated <span class="math inline">()</span> the prediction interval would be
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qpois</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"lower"</span><span class="p">])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0 1</code></pre>
</figure>
<p>
and
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qpois</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="s2">"upper"</span><span class="p">])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0 2</code></pre>
</figure>
<p>
for the upper end, leading to a prediction interval of 0ā2 breeding pairs. The same prediction interval for the five year old field would be
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">qpois</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">p</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="s2">"lower"</span><span class="p">],</span><span class="w"> </span><span class="n">p</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="s2">"upper"</span><span class="p">]))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0 5</code></pre>
</figure>
<p>
We can also look at the probability densities of the poisson distribution for the estimated value of <span class="math inline">()</span> and its 95% confidence interval. The table below shows these probability densities for a two-year old field
</p>
<table>
<caption>
Posterior densities for selected numbers of breeding pairs for a two year old field. Columns show the densities for a poisson distribution with <span class="math inline">()</span> equal to the esimated value and the lower and upper limits on the estimated values for this field.
</caption>
<thead>
<tr class="header">
<th style="text-align: right;">
# of pairs
</th>
<th style="text-align: right;">
lower
</th>
<th style="text-align: right;">
estimate
</th>
<th style="text-align: right;">
upper
</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">
0
</td>
<td style="text-align: right;">
0.9454
</td>
<td style="text-align: right;">
0.8358
</td>
<td style="text-align: right;">
0.5640
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
1
</td>
<td style="text-align: right;">
0.0531
</td>
<td style="text-align: right;">
0.1499
</td>
<td style="text-align: right;">
0.3230
</td>
</tr>
<tr class="odd">
<td style="text-align: right;">
2
</td>
<td style="text-align: right;">
0.0015
</td>
<td style="text-align: right;">
0.0134
</td>
<td style="text-align: right;">
0.0925
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
3
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0008
</td>
<td style="text-align: right;">
0.0177
</td>
</tr>
<tr class="odd">
<td style="text-align: right;">
4
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0025
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
5
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0003
</td>
</tr>
<tr class="odd">
<td style="text-align: right;">
6
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
7
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
</tr>
</tbody>
</table>
<p>
For example, weād expect to observe no breeding pairs in 95ā56% of average sized, two-year old fields. The same values are shown for a five-year old tree in the table below
</p>
<table>
<caption>
Posterior densities for selected numbers of breeding pairs for a five year old field. Columns show the densities for a poisson distribution with <span class="math inline">()</span> equal to the esimated value and the lower and upper limits on the estimated values for this field.
</caption>
<thead>
<tr class="header">
<th style="text-align: right;">
# of pairs
</th>
<th style="text-align: right;">
lower
</th>
<th style="text-align: right;">
estimate
</th>
<th style="text-align: right;">
upper
</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">
0
</td>
<td style="text-align: right;">
0.6328
</td>
<td style="text-align: right;">
0.4154
</td>
<td style="text-align: right;">
0.1852
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
1
</td>
<td style="text-align: right;">
0.2896
</td>
<td style="text-align: right;">
0.3649
</td>
<td style="text-align: right;">
0.3123
</td>
</tr>
<tr class="odd">
<td style="text-align: right;">
2
</td>
<td style="text-align: right;">
0.0663
</td>
<td style="text-align: right;">
0.1603
</td>
<td style="text-align: right;">
0.2633
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
3
</td>
<td style="text-align: right;">
0.0101
</td>
<td style="text-align: right;">
0.0469
</td>
<td style="text-align: right;">
0.1480
</td>
</tr>
<tr class="odd">
<td style="text-align: right;">
4
</td>
<td style="text-align: right;">
0.0012
</td>
<td style="text-align: right;">
0.0103
</td>
<td style="text-align: right;">
0.0624
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
5
</td>
<td style="text-align: right;">
0.0001
</td>
<td style="text-align: right;">
0.0018
</td>
<td style="text-align: right;">
0.0210
</td>
</tr>
<tr class="odd">
<td style="text-align: right;">
6
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0003
</td>
<td style="text-align: right;">
0.0059
</td>
</tr>
<tr class="even">
<td style="text-align: right;">
7
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0000
</td>
<td style="text-align: right;">
0.0014
</td>
</tr>
</tbody>
</table>
<!--
| Age [yrs]| 0| 1| 2| 3| 4| 5| 6| 7|
|---------:|-----:|-----:|-----:|-----:|-----:|-----:|--:|--:|
| 1| 0.948| 0.051| 0.002| 0.000| 0.000| 0.000| 0| 0|
| 2| 0.835| 0.152| 0.012| 0.001| 0.000| 0.000| 0| 0|
| 3| 0.626| 0.295| 0.066| 0.012| 0.002| 0.000| 0| 0|
| 4| 0.455| 0.355| 0.147| 0.035| 0.006| 0.002| 0| 0|
| 5| 0.408| 0.371| 0.160| 0.049| 0.012| 0.001| 0| 0|
| 6| 0.476| 0.353| 0.133| 0.032| 0.004| 0.001| 0| 0|
| 7| 0.553| 0.335| 0.093| 0.018| 0.002| 0.000| 0| 0|
| 8| 0.628| 0.290| 0.071| 0.010| 0.001| 0.000| 0| 0|
| 9| 0.672| 0.263| 0.058| 0.007| 0.000| 0.000| 0| 0|
Table: Estimated probability of the visited and no-visited outcomes based on the upper (upr) and lower (lwr) 95% interval of the model-estimated probability of visitation for two leafe heights.
| Age [yrs]| 0| 1| 2| 3| 4| 5| 6| 7|
|---------:|-----:|-----:|-----:|-----:|-----:|-----:|--:|--:|
| 1| 0.948| 0.051| 0.001| 0.000| 0.000| 0.000| 0| 0|
| 2| 0.836| 0.150| 0.013| 0.001| 0.000| 0.000| 0| 0|
| 3| 0.625| 0.294| 0.069| 0.011| 0.001| 0.000| 0| 0|
| 4| 0.450| 0.359| 0.143| 0.038| 0.008| 0.001| 0| 0|
| 5| 0.415| 0.365| 0.160| 0.047| 0.010| 0.002| 0| 0|
| 6| 0.476| 0.353| 0.131| 0.032| 0.006| 0.001| 0| 0|
| 7| 0.556| 0.326| 0.096| 0.019| 0.003| 0.000| 0| 0|
| 8| 0.618| 0.297| 0.072| 0.011| 0.001| 0.000| 0| 0|
| 9| 0.665| 0.271| 0.055| 0.008| 0.001| 0.000| 0| 0|
Table: Estimated probability of the visited and no-visited outcomes based on the upper (upr) and lower (lwr) 95% interval of the model-estimated probability of visitation for two leafe heights.
-->
<p>
I could repeat the process of simulating breeding pairs from the poisson distributions with estimated values of <span class="math inline">()</span> but the code to illustrate this gets tedious and this post is long enough already.
</p>
<p>
The prediction intervals for the Poisson model are starting to look more like intervals than the ones for the binomial model we looked at earlier. Theyāre still not something we can easily convey on a plot like we can with linear models and <code>predict.lm</code>, however.
</p>
<p>
For continous conditional distributions, prediction āintervalsā act like their linear model counterparts, as long as we take the extra step of computing the prediction interval using the probability quantile function (the <code>qfoo()</code> functions in R where <code>foo</code> is the abbreviation for the distribution) and potentially include the uncertainty in the estimated expectations (fitted values on the response scale) as we did in both examples above.
</p>
<p>
Ok, I think thatās enough modelling pedantry for one [Ed: er-um, <em>two</em>] post.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Korner-Nievergelt2015-tk">
<p>
Korner-Nievergelt, F., Felten, S. von, Roth, T., Gulat, J., Almasi, B., and Korner-Nievergelt, P. (2015). <em>Bayesian data analysis in ecology using linear models with R, BUGS, and stan</em>. Elsevier Science & Technology Books.
</p>
</div>
</div>
Prediction intervals for GLMs part I
Gavin L. Simpson
2017-05-01T08:45:00-06:00
2017-05-01T08:45:00-06:00
https://www.fromthebottomoftheheap.net/2017/05/01/glm-prediction-intervals-i/
<p>
One of my more popular <a href="http://stackoverflow.com/a/14424417/429846">answers</a> on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). My answer really only addresses how to compute confidence intervals for parameters but in the comments I discuss the more substantive points raised by the OP in their question. Lately thereās been a bit of back and forth between Jarrett Byrnes and myself about what a prediction āintervalā for a GLM might mean. Comments, even on StackOverflow, arenāt a good place for a discussion so I thought Iād post something here that went into a bit more detail as to why, for some common types of GLMs, prediction intervals arenāt that useful and require a lot more thinking about what they mean and how they should be calculated. For illustration, I thought Iād use some small teaching example data sets, but whilst writing the post it started to get a little on the long side. So, Iāve broken it into two and in this part I look at logistic regression.
</p>
<p>
One of my more popular <a href="http://stackoverflow.com/a/14424417/429846">answers</a> on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). My answer really only addresses how to compute confidence intervals for parameters but in the comments I discuss the more substantive points raised by the OP in their question. Lately thereās been a bit of back and forth between Jarrett Byrnes and myself about what a prediction āintervalā for a GLM might mean. Comments, even on StackOverflow, arenāt a good place for a discussion so I thought Iād post something here that went into a bit more detail as to why, for some common types of GLMs, prediction intervals arenāt that useful and require a lot more thinking about what they mean and how they should be calculated. For illustration, I thought Iād use some small teaching example data sets, but whilst writing the post it started to get a little on the long side. So, Iāve broken it into two and in this part I look at logistic regression.
</p>
<p>
The first example concerns a small experiment on the rare insectivorous pitcher plant <em>Darlingtonia californica</em> (the cobra lily) used as an example in <span class="citation" data-cites="Gotelli2013-wm">Gotelli and Ellison (2013)</span> and originally reported in <span class="citation" data-cites="Dixon2005-bb">Dixon et al.Ā (2005)</span>. <em>Darlingtonia</em> grows leaves that are modified to form a pitcher trap, which is filled with nectar that attracts insects, in particular vespulid wasps (<em>Vespula atropilosa</em>). The observations in the data set are on the height of pitcher traps (<code>leafHeight</code>) and whether or not the leaf was visited by a wasp (<code>visited</code>). The code chunk below downloads the data from the bookās website and loads it into R ready for use.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">darlurl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"http://harvardforest.fas.harvard.edu/sites/harvardforest.fas.harvard.edu/files/ellison-pubs/2004/DarlingtoniaData3.txt"</span><span class="w">
</span><span class="n">darl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">read.fwf</span><span class="p">(</span><span class="n">darlurl</span><span class="p">,</span><span class="w"> </span><span class="n">widths</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">8</span><span class="p">,</span><span class="m">9</span><span class="p">),</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="s2">"leafHeight"</span><span class="p">,</span><span class="w"> </span><span class="s2">"visited"</span><span class="p">))</span><span class="w">
</span><span class="n">darl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">darl</span><span class="p">,</span><span class="w"> </span><span class="n">visited</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.logical</span><span class="p">(</span><span class="n">visited</span><span class="p">))</span></code></pre>
</figure>
<p>
Kernel density estimates of the distributions of the leaf heights for visited and unvisited leaves is one way to visualise these data. Here we use <strong>ggplot2</strong>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Leaf height [cm]"</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">darl</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">visited</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"density"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlab</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Density"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/glm-prediction-intervals-i-load-packages-plot-darlingtonia-1.png" alt="Kernel density estimates of the distribution of heights of leaves visited or not by wasps." />
<figcaption>
Kernel density estimates of the distribution of heights of leaves visited or not by wasps.
</figcaption>
</figure>
<p>
Weāre interested in modelling the probability of leaf visitation as a function of leaf height. For this a binomial GLM is a logical choice, with the canonical link function, the logit or logistic function. Such a model is fitted using <code>glm()</code> as follows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">visited</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">darl</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">binomial</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Call:
glm(formula = visited ~ leafHeight, family = binomial, data = darl)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.18274 -0.46820 -0.23897 -0.08519 1.90573
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.29295 2.16081 -3.375 0.000738 ***
leafHeight 0.11540 0.03655 3.158 0.001591 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.105 on 41 degrees of freedom
Residual deviance: 26.963 on 40 degrees of freedom
AIC: 30.963
Number of Fisher Scoring iterations: 6</code></pre>
</figure>
<p>
The model summary suggests an effect of leaf height that is unlikely to be observed if there were no effect. For a unit increase in leaf height, the odds of visitation increase by 1.12 times (given by <code>exp(coef(m)[2])</code>).
</p>
<p>
How the probability of visitation varies as a function of leaf height, as estimated by the binomial GLM, can be visualised by predicting for a grid of values over the observed range of leaf heights. An approximate 95% point-wise confidence interval can also be created for the fitted function. In this case, we should create the confidence interval on the scale of the linear predictor where we assume things behave in a more Gaussian-like manner, and then backtransform the calculated interval on to the probability scale using the invers of the link function. The code below shows a general solution for this, where the inverse link function is obtained from the <code>family()</code> object contained within the fitted GLM object
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ilink</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">family</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="o">$</span><span class="n">linkinv</span><span class="w">
</span><span class="n">pd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">darl</span><span class="p">,</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">leafHeight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">leafHeight</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">leafHeight</span><span class="p">),</span><span class="w">
</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)))</span><span class="w">
</span><span class="n">pd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"link"</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">pd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">Fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="p">),</span><span class="w"> </span><span class="n">Upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)),</span><span class="w">
</span><span class="n">Lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ilink</span><span class="p">(</span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">)))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">darl</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">visited</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Upper</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">),</span><span class="w">
</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"steelblue2"</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Fitted</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">leafHeight</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Probability of visitation"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlab</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/glm-prediction-intervals-i-fitted-function-and-ci-1.png" alt="Estimated probability of visitation plus pointwise 95% confidence interval." />
<figcaption>
Estimated probability of visitation plus pointwise 95% confidence interval.
</figcaption>
</figure>
<p>
So far, so standard; the confidence interval is just that, a Wald confidence interval on the fitted function based on the standard errors of the estimates of the model coefficients. It is not a prediction interval, however.
</p>
<p>
The fitted model can be interpreted as describing the binomial distribution for any given value of <code>leafHeight</code>. The binomial distribution is specified by two parameters; <em>n</em> the number of trials (specified via argument <code>size</code> in Rās <code>dbinom()</code> and related functions), and <em>p</em> the probability of success. In the <em>Darlingtonia</em> example, <em>n</em> is 1 because each leaf was the result of 1 trial; was the leaf visited or not during the experiment? <em>p</em> is given by <span class="math inline">(g()^{-1} = g(_0 + _1 )^{-1})</span>, where <span class="math inline">(g)</span> is the logit link function and <span class="math inline">(g^{-1})</span> is its inverse. In other words, the probability parameter of the binomial distribution is a function of <code>leafHeight</code>.
</p>
<p>
To create a prediction interval for a value of <code>leafHeight</code>, we could look at the probability quantiles of the binomial distribution with <code>size = 1</code> and <code>prob = Fitted[leafHeight]</code>. For example, for the minimum and maximum observed leaf heights the extreme 2.5% and 97.5% probability quantiles are
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">qbinom</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">Fitted</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">)))</span><span class="w">
</span><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">qbinom</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">Fitted</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">)))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0 0
[1] 0 1</code></pre>
</figure>
<p>
In the first instance, for the minimum observed leaf height, the prediction interval is 0. Yes, just 0. For the maximum observed leaf height the 95% prediction interval is 0ā1. Neither of these is very useful; one isnāt even an interval in the usual sense of the word, and the other is so wide as to encompass both 0 and 1, which is no more information than we had before we started the whole exercise ā a leaf can only be visited or not.
</p>
<p>
But this isnāt quite what we want; weāve only explore the quantiles of the distributions conditional upon the estimated probability. A real prediction interval would account for the uncertainty in this estimate. For that, we need the upper and lower confidence limits for the estimated probability.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">qbinom</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">Lower</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">Upper</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">))))</span><span class="w">
</span><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">qbinom</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">tail</span><span class="p">(</span><span class="n">Lower</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">),</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">Upper</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">))))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0 1
[1] 0 1</code></pre>
</figure>
<p>
I think we can all agree that these intervals arenāt really that usefulā¦
</p>
<p>
Another way to use the fitted model is via what it says about the posterior density of the two possible predicted values, visited or unvisited. This can be computed using <code>dbinom()</code> using the code below, again for the minimum and maximum observed leaf heights
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">db</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">dbinom</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Fitted</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)]),</span><span class="w">
</span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">db</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"NotVisited"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Visited"</span><span class="p">)</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">db</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="s2">"leafHeight ="</span><span class="p">,</span><span class="w"> </span><span class="nf">range</span><span class="p">(</span><span class="n">leafHeight</span><span class="p">)))</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">db</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> NotVisited Visited
leafHeight = 14 0.9966 0.0034
leafHeight = 84 0.0831 0.9169</code></pre>
</figure>
<p>
We see almost all the probability density on the unvisited option for leaves 14cm in height (which is also why the 95% interval we calculated earlier was all on unvisted (0), weād need to go beyond a 99.7% interval to get the visited alernative (1) included in the interval). For leaves of 84cm, most of the density is on the visited outcome, but with approximately 8% on the unvisited outcome.
</p>
<p>
However, these values are exactly what we get if we just take the fitted probabilities for these leaf heights, which are given by the solid line in the plot we made earlier
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">Fitted</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0.003410958 0.916879065</code></pre>
</figure>
<p>
These values are for the visited outcome, but subtract them from 1 and you have the values for the unvisited outcome
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">Fitted</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0.99658904 0.08312093</code></pre>
</figure>
<p>
As before, this ignores the uncertainty in the estimated probability of visitation. The densities incorporating this uncertainty are shown in the table below
</p>
<table>
<caption>
Estimated probability of the visited and not-visited outcomes based on the upper (upr) and lower (lwr) 95% interval of the model-estimated probability of visitation for two leaf heights.
</caption>
<thead>
<tr class="header">
<th style="text-align: left;">
</th>
<th style="text-align: right;">
Not Visited (lwr)
</th>
<th style="text-align: right;">
Not Visited (upr)
</th>
<th style="text-align: right;">
Visited (lwr)
</th>
<th style="text-align: right;">
Visited (upr)
</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">
leafHeight = 14
</td>
<td style="text-align: right;">
0.9999
</td>
<td style="text-align: right;">
0.9125
</td>
<td style="text-align: right;">
0.0001
</td>
<td style="text-align: right;">
0.0875
</td>
</tr>
<tr class="even">
<td style="text-align: left;">
leafHeight = 84
</td>
<td style="text-align: right;">
0.4415
</td>
<td style="text-align: right;">
0.0103
</td>
<td style="text-align: right;">
0.5585
</td>
<td style="text-align: right;">
0.9897
</td>
</tr>
</tbody>
</table>
<p>
One more thing we can do with the fitted model is simulate random outcomes from it. Again we do this for the minimum and maximum observed leaf heights, first for the lowest leaf height
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">nrand</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">rbinom</span><span class="p">(</span><span class="n">nrand</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">Fitted</span><span class="p">[</span><span class="m">1</span><span class="p">])))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">nrand</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> 0 1
0.9977 0.0023 </code></pre>
</figure>
<p>
and then for the largest observed leaf height
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="n">rbinom</span><span class="p">(</span><span class="n">nrand</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pd</span><span class="p">,</span><span class="w"> </span><span class="n">Fitted</span><span class="p">[</span><span class="m">100</span><span class="p">])))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">nrand</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> 0 1
0.0867 0.9133 </code></pre>
</figure>
<p>
The numbers should look pretty familiar ā they are very close to both the posterior densities returned using <code>dbinom()</code> and the the fitted probabilities we just looked at. In fact, as <code>nrand</code> tends to infinity, the proportions of the two outcomes will also approach those given by <code>dbinom()</code>. As before, though I wonāt show it, a complete interval would also include the uncertainty in the estimated probability.
</p>
<p>
In this example, the most useful outputs from the model are all based on the binomial distributions given values of leaf height. The interval given by the extreme 2.5th and 97.5th probability quantiles isnāt of much use at all; for the two values of leaf height we looked at the interval either wasnāt an interval or it told us no more information than we already possessed, that leaves either were or were not visited.
</p>
<p>
That said, this binomial GLM example is pretty extreme; the observed data only take values <em>0</em> or <em>1</em> and nothing else. However, this has been a useful exercise to think about what the fitted model represents.
</p>
<p>
In the second part of this post Iāll look at a model for a count response, which will start to look a little more interval-like than the one here.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Dixon2005-bb">
<p>
Dixon, P. M., Ellison, A. M., and Gotelli, N. J. (2005). Improving the precision of estimates of the frequency of rare events. <em>Ecology</em> 86, 1114ā1123. doi:<a href="https://doi.org/10.1890/04-0601">10.1890/04-0601</a>.
</p>
</div>
<div id="ref-Gotelli2013-wm">
<p>
Gotelli, N. J., and Ellison, A. M. (2013). <em>A primer of ecological statistics</em>. second. Sinauer Associates Inc.
</p>
</div>
</div>
Simultaneous intervals for derivatives of smooths revisited
Gavin L. Simpson
2017-03-21T00:00:00-06:00
2017-03-21T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2017/03/21/simultaneous-intervals-for-derivatives-of-smooths/
<p>
Eighteen months ago <a href="https://www.fromthebottomoftheheap.net/2016/12/15/simultaneous-interval-revisited/">I screwed up</a>! Iād written a <a href="https://www.fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/">post</a> in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalized spline. It was a nice post that attracted some interest. It was also wrong. In December I corrected the first part of that mistake by illustrating one approach to compute an actual simultaneous interval, but only for the fitted smoother. At the time I thought that the approach I outlined would translate to the derivatives but I was being lazy then Christmas came and went and I was back to teaching ā you know how it goes. Anyway, in this post I hope to finally rectify my past stupidity and show how the approach used to generate simultaneous intervals from the December 2016 post can be applied to the derivatives of a spline.
</p>
<p>
Eighteen months ago <a href="https://www.fromthebottomoftheheap.net/2016/12/15/simultaneous-interval-revisited/">I screwed up</a>! Iād written a <a href="https://www.fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/">post</a> in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalized spline. It was a nice post that attracted some interest. It was also wrong. In December I corrected the first part of that mistake by illustrating one approach to compute an actual simultaneous interval, but only for the fitted smoother. At the time I thought that the approach I outlined would translate to the derivatives but I was being lazy then Christmas came and went and I was back to teaching ā you know how it goes. Anyway, in this post I hope to finally rectify my past stupidity and show how the approach used to generate simultaneous intervals from the December 2016 post can be applied to the derivatives of a spline.
</p>
<p>
If you havenāt read the December 2016 post I suggest you do so as there I explain this:
</p>
<p>
<span class="math display">[ <span class="math display">\[\begin{align}
\mathbf{\hat{f}_g} &amp;\pm m_{1 - \alpha} \begin{bmatrix}
\widehat{\mathrm{st.dev}} (\hat{f}(g_1) - f(g_1)) \\
\widehat{\mathrm{st.dev}} (\hat{f}(g_2) - f(g_2)) \\
\vdots \\
\widehat{\mathrm{st.dev}} (\hat{f}(g_M) - f(g_M)) \\
\end{bmatrix}
\end{align}\]</span> ]</span>
</p>
<p>
This equation states that the critical value for a 100(1 - <span class="math inline">()</span>)% simultaneous interval is given by the 100(1 - <span class="math inline">()</span>)% quantile of the distribution of the standard errors of deviations of the fitted values from the true values of the smoother. We donāt know this distribution, so we generated realizations from it using simulation, and used the empirical quantiles of the simulated distribution to give the appropriate critical value <span class="math inline">(m)</span> with which to calculate the simultaneous interval. In that post I worked my way through some R code to show how you can calculate this for a fitted spline.
</p>
<p>
To keep this post relatively short, I wonāt rehash the discussion of the code used to compute the critical value <span class="math inline">(m)</span>. I also wonāt cover in detail how these derivatives are computed. We use finite differences and the general approach is explained in an <a href="/2014/05/15/identifying-periods-of-change-with-gams/">older post</a>. I donāt recommend you use the code in that post for real data analysis, however. Whilst I was putting together this post I re-wrote the derivative code as well as that for computing point-wise and simultaneous intervals and started a new R package <strong>tsgam</strong>. <strong>tsgam</strong> is is <a href="http://github.com/gavinsimpson/tsgam">available on GitHub</a> and weāll use it here. Note this package isnāt even at version 0.1 yet, but the code for derivatives and intervals has been through several iterations now and works well whenever I have tested it.
</p>
<p>
Assuming you have the <strong>devtools</strong> package installed, you can install <strong>tsgam</strong> using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"gavinsimpson/tsgam"</span><span class="p">)</span></code></pre>
</figure>
<p>
As example data, Iāll again use the strontium isotope data set included in the <strong>SemiPar</strong> package, and which is extensively analyzed in the monograph <em>Semiparametric Regression</em> <span class="citation" data-cites="Ruppert2003-pt">(Ruppert et al., 2003)</span>. First, load the packages weāll need as well as the data, which is data set <code>fossil</code>. If you donāt have <strong>SemiPar</strong> installed, install it using <code>install.packages(āSemiParā)</code> before proceeding
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w"> </span><span class="c1"># fit the GAM</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"tsgam"</span><span class="p">)</span><span class="w"> </span><span class="c1"># code for derivatives & intervals</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w"> </span><span class="c1"># package for nice plots</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w"> </span><span class="c1"># simpler theme for the plots</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"SemiPar"</span><span class="p">)</span><span class="w"> </span><span class="c1"># load the data</span></code></pre>
</figure>
<p>
The <code>fossil</code> data set includes two variables and is a time series of strontium isotope measurements on samples from a sediment core. The data are shown below using <code>ggplot()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strontium.ratio</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_x_reverse</span><span class="p">()</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-for-derivatives-of-smooths-plot-fossil-data-1.png" alt="The strontium isotope example data used in the post" />
<figcaption>
The strontium isotope example data used in the post
</figcaption>
</figure>
<p>
The aim of the analysis of these data is to model how the measured strontium isotope ratio changed through time, using a GAM to estimate the clearly non-linear change in the response. As time is the complement of sediment age, we should probably model this on that time scale, especially if you wanted to investigate residual temporal auto-correlation. This requires creating a new variable <code>negAge</code> for modelling purposes only
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">fossil</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">negAge</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="n">age</span><span class="p">)</span></code></pre>
</figure>
<p>
As per the previous post a reasonable GAM for these data is fitted using <strong>mgcv</strong> and <code>gam()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">strontium.ratio</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">negAge</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
Having fitted the model we should do some evaluation of it but Iām going to skip that here and move straight to computing the derivative of the fitted spline and a simultaneous interval for it. First we set some constants that we can refer to throughout the rest of the post
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## parameters for testing</span><span class="w">
</span><span class="n">UNCONDITIONAL</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">FALSE</span><span class="w"> </span><span class="c1"># unconditional or conditional on estimating smooth params?</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span><span class="w"> </span><span class="c1"># number of posterior draws</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">500</span><span class="w"> </span><span class="c1"># number of newdata values</span><span class="w">
</span><span class="n">EPS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1e-07</span><span class="w"> </span><span class="c1"># finite difference</span></code></pre>
</figure>
<p>
To facilitate checking that this interval has the correct coverage properties Iām going to fix the locations where weāll evaluate the derivative, calculating the vector of values to predict at once only. Normally you wouldnāt need to do this just to compute the derivatives and associated confidence intervals ā you would just need to set the number of values <code>n</code> over the range of the predictors you want ā and if you have a model with several splines it is probably easier to let <strong>tsgam</strong> handle this part for you.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## where are we going to predict at?</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">negAge</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">negAge</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">negAge</span><span class="p">),</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)))</span></code></pre>
</figure>
<p>
The <code>fderiv()</code> function in <strong>tsgam</strong> computes the first derivative of any splines in the supplied GAM<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> or you can request derivatives for a specified smooth term. As we have only a single smooth term in the model, we simply pass in the model and the data frame of locations at which to evaluate the derivative
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">fd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fderiv</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">eps</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">EPS</span><span class="p">,</span><span class="w"> </span><span class="n">unconditional</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UNCONDITIONAL</span><span class="p">)</span></code></pre>
</figure>
<p>
(we set <code>eps = EPS</code>, so we have the same grid shift later in the post when checking coverage properties, and donāt account for the uncertainty due to estimating the smoothness parameters (<code>unconditional = FALSE</code>), normally you can leave these at the defaults). The object returned by <code>fderiv()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">str</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">List of 6
$ derivatives :List of 1
$ terms : chr "negAge"
$ model :List of 52
..- attr(*, "class")= chr [1:3] "gam" "glm" "lm"
$ eps : num 1e-07
$ eval :'data.frame': 500 obs. of 1 variable:
$ unconditional: logi FALSE
- attr(*, "class")= chr "fderiv"</code></pre>
</figure>
<p>
contains a component <code>derivatives</code> that contains the evaluated derivatives for all or the selected smooth terms. The other components include a copy of the fitted model and some additional parameters that are required for the confidence intervals. Confidence intervals for the derivatives are computed using the <code>confint()</code> method. The <code>type</code> argument specifies whether point-wise or simultaneous intervals are required. For the latter, the number of simulations to draw is required via <code>nsim</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w"> </span><span class="c1"># set the seed to make this repeatable </span><span class="w">
</span><span class="n">sint</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">confint</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"simultaneous"</span><span class="p">,</span><span class="w"> </span><span class="n">nsim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">)</span></code></pre>
</figure>
<p>
To make it easier to work with the results I wrote the <code>confint()</code> method so that it returned the confidence interval as a tidy data frame suitable for plotting with <strong>ggplot2</strong>. <code>sint</code> is a data frame with an identifier for which smooth term each row relates to (<code>term</code>), plus columns containing the estimated (<code>est</code>) derivative and the lower and upper confidence interval
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">head</span><span class="p">(</span><span class="n">sint</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> term lower est upper
1 negAge -0.000053 -6.1e-06 0.000041
2 negAge -0.000053 -6.1e-06 0.000041
3 negAge -0.000053 -6.1e-06 0.000040
4 negAge -0.000052 -6.1e-06 0.000040
5 negAge -0.000052 -6.1e-06 0.000040
6 negAge -0.000051 -6.0e-06 0.000039</code></pre>
</figure>
<p>
The estimated derivative plus its 95% simultaneous confidence interval are shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">sint</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="n">newd</span><span class="o">$</span><span class="n">negAge</span><span class="p">),</span><span class="w">
</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">est</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_hline</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_reverse</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"First derivative"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Age"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-for-derivatives-of-smooths-plot-sint-1.png" alt="Estimated first derivative of the spline fitted to the strontium isotope data. The grey band shows the 95% simultaneous interval." />
<figcaption>
Estimated first derivative of the spline fitted to the strontium isotope data. The grey band shows the 95% simultaneous interval.
</figcaption>
</figure>
<p>
So far so good.
</p>
<p>
Having thought about how to apply the theory outlined in the previous post, it seems that all we need to do to apply it to derivatives is to make the assumption that <em>the estimate of the first derivative is unbiased</em> and hence we can proceed as we did in the previous post by computing <code>BUdiff</code> using a multivariate normal with zero mean vector and the Bayesian covariance matrix of the model coefficients. Where the version for derivatives differs is that we use a prediction matrix for the derivatives instead of for the fitted spline. This prediction matrix is created as follows
</p>
<ol type="1">
<li>
generate a prediction matrix from the current model for the locations in <code>newd</code>,
</li>
<li>
generate a second prediction matrix as before but for slightly shifted locations <code>newd + eps</code>
</li>
<li>
difference these two prediction matrices yielding the prediction matrix for the first differences <code>Xp</code>
</li>
<li>
for each smooth in turn
<ol type="1">
<li>
create a zero matrix, <code>Xi</code>, of the same dimensions as the prediction matrices
</li>
<li>
fill in the columns of <code>Xi</code> that relate to the current smooth using the values of the same columns from <code>Xp</code>
</li>
<li>
multiply <code>Xi</code> by the vector of model coefficients to yield predicted first differences
</li>
<li>
calculate the standard error of these predictions
</li>
</ol>
</li>
</ol>
<p>
The matrix <code>Xi</code> is supplied for each smooth term in the <code>derivatives</code> component of the object returned by <code>fderiv()</code>.
</p>
<p>
Once Iād grokked this one basic assumption about the unbiasedness of the first derivative, the rest of the translation of the method to derivatives fell into place. As we are using finite differences, we may be a little biased in estimating the first derivatives, but this can be reduced by makes <code>eps</code> smaller, thought the default probably suffices.
</p>
<p>
To see the detail of how this is done, look at the source code for <code>tsgam:::simultaneous</code>, which apart from a bit of renaming of objects follows closely the code in the <a href="https://www.fromthebottomoftheheap.net/2016/12/15/simultaneous-interval-revisited/">previous post</a>.
</p>
<p>
Having computed the purported simultaneous interval for the derivatives of the trend, we should do what I didnāt do in the original posts about these intervals and go and look at the coverage properties of the generated interval.
</p>
<p>
To do that Iām going to simulate a large number, <code>N</code>, of draws from the posterior distribution of the model. Each of these draws is a fitted spline that includes the uncertainty in the estimated model coefficients. Note that Iām not including a correction here for the uncertainty due to smoothing parameters being estimated ā you can set <code>unconditional = TRUE</code> throughout (or change <code>UNCONDITIONAL</code> above) to include this extra uncertainty if you wish.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">Vb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vcov</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">unconditional</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">UNCONDITIONAL</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">24</span><span class="p">)</span><span class="w">
</span><span class="n">sims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">MASS</span><span class="o">::</span><span class="n">mvrnorm</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">Sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Vb</span><span class="p">)</span><span class="w">
</span><span class="n">X0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">newd</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">EPS</span><span class="w">
</span><span class="n">X1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">Xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">X1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X0</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">EPS</span><span class="w">
</span><span class="n">derivs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xp</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">sims</span><span class="p">)</span></code></pre>
</figure>
<p>
The code above basically makes a large number of draws from the model posterior and applies the steps of the algorithm outlined above to generate <code>derivs</code>, a matrix containing 10000 draws from the posterior distribution of the model derivatives. Our simultaneous interval should entirely contain about 95% of these posterior draws. Note that a draw here refers to the entire set of evaluations of the first derivative for each posterior draw from the model. The plot below shows 50 such draws (lines)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="n">derivs</span><span class="p">[,</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">)],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-for-derivatives-of-smooths-plot-sample-of-derivs-1.png" alt="50 draws from the posterior distribution of the first derivative of the fitted spline." />
<figcaption>
50 draws from the posterior distribution of the first derivative of the fitted spline.
</figcaption>
</figure>
<p>
and 95% of the 10000 draws (lines) should lie <em>entirely</em> within the simultaneous interval if it has the right coverage properties. Put the other way, only 5% of the draws (lines) should ever venture outside the limits of the interval.
</p>
<p>
To check this is the case, we reuse the the <code>inCI()</code> function that checks if a draw lies entirely within the interval or not
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">inCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">upr</span><span class="p">,</span><span class="w"> </span><span class="n">lwr</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">all</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">lwr</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">upr</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
As each <em>column</em> of <code>derivs</code> contains a different draw, we want to apply <code>inCI()</code> to each column in turn
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">fitsInCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">sint</span><span class="p">,</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">derivs</span><span class="p">,</span><span class="w"> </span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="n">inCI</span><span class="p">,</span><span class="w"> </span><span class="n">upr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">lwr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">))</span></code></pre>
</figure>
<p>
<code>inCI()</code> returns a <code>TRUE</code> if all the points that make up the line representing a single posterior draw lie within the interval and <code>FALSE</code> otherwise, therefore we can sum up the <code>TRUE</code>s (recall that a <code>TRUE == 1</code> and a <code>FALSE == 0</code>) and divide by the number of draws to get an estimate of the coverage properties of the interval. If we do this for our interval
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="nf">sum</span><span class="p">(</span><span class="n">fitsInCI</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fitsInCI</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0.95</code></pre>
</figure>
<p>
we see that the interval includes 95% of the 10000 draws. Which, youāll agree is pretty close to the desired coverage of 95%.
</p>
<p>
Thatās it for this post; whilst the signs are encouraging that these simultaneous intervals have the required coverage properties, Iāve only looked at them for a simple single-term GAM, and only for a response that is conditionally distributed Gaussian. I also havenāt looked at anything other than the coverage at an expected 95%. If you do use this in your work, please do check that the interval is working as anticipated. If you do discover problems, please let me know either in the comments below or via email. The next task is to start thinking about extending these ideas to work with a wider range of GAMs that <strong>mgcv</strong> can fit, include location-scale models and models with factor-smooth interactions.
</p>
<h2 id="references" class="unnumbered">
References
</h2>
<div id="refs" class="references">
<div id="ref-Ruppert2003-pt">
<p>
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). <em>Semiparametric regression</em>. Cambridge University Press.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
<code>fderiv()</code> currently works for smooths of a single variable fitted using <code>gam()</code> or <code>gamm()</code>. It hasnāt been tested with the location-scale extended families in newer versions of <strong>mgcv</strong> and I doubt it will work with them currently. <a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Modelling extremes using generalized additive models
Gavin L. Simpson
2017-01-25T00:00:00-06:00
2017-01-25T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2017/01/25/modelling-extremes-with-gams/
<p>
Quite some years ago, whilst working on the EU Sixth Framework project <em>Euro-limpacs</em>, I organized a workshop on statistical methods for analyzing time series data. One of the sessions was on the analysis of extremes, ably given by Paul Northrop (UCL Department of Statistical Science). That intro certainly whet my appetite but I never quite found the time to dig into the arcane world of extreme value theory. Two recent events rekindled my interest in extremes; Simon Wood quietly introduced into his <strong>mgcv</strong> package a family function for the generalized extreme value distribution (GEV), and I was asked to review a paper on extremes in time series. Since then Iāve been investigating options for fitting models for extremes to environmental time series, especially those that allow for time-varying effects of covariates on the parameters of the GEV. One of the first things I did was sit down with <strong>mgcv</strong> to get a feel for the <code>gevlss()</code> family function that Simon had added to the package by repeating an analysis of a classic example data set that had been performed using the <strong>VGAM</strong> package of Thomas Yee.
</p>
<p>
Quite some years ago, whilst working on the EU Sixth Framework project <em>Euro-limpacs</em>, I organized a workshop on statistical methods for analyzing time series data. One of the sessions was on the analysis of extremes, ably given by Paul Northrop (UCL Department of Statistical Science). That intro certainly whet my appetite but I never quite found the time to dig into the arcane world of extreme value theory. Two recent events rekindled my interest in extremes; Simon Wood quietly introduced into his <strong>mgcv</strong> package a family function for the generalized extreme value distribution (GEV), and I was asked to review a paper on extremes in time series. Since then Iāve been investigating options for fitting models for extremes to environmental time series, especially those that allow for time-varying effects of covariates on the parameters of the GEV. One of the first things I did was sit down with <strong>mgcv</strong> to get a feel for the <code>gevlss()</code> family function that Simon had added to the package by repeating an analysis of a classic example data set that had been performed using the <strong>VGAM</strong> package of Thomas Yee.
</p>
<p>
The analysis I wanted to recreate was reported in a 2007 paper by Thomas Yee and Alec Stephenson <span class="citation" data-cites="Yee2007-rz">(Yee and Stephenson, 2007)</span> and concerned a time series of annual maximum sea-level at Fremantle, Western Australia. This example is also used extensively in Stuart Coles excellent book on statistical modeling of extremes <span class="citation" data-cites="Coles2001-zz">(Coles, 2001)</span>. The data are available from the <strong>ismev</strong> support package for Colesā book in the data set <code>fremantle</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## install.packages("ismev") # if not installed!</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ismev"</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">fremantle</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Year SeaLevel SOI
1 1897 1.58 -0.67
2 1898 1.71 0.57
3 1899 1.40 0.16
4 1900 1.34 -0.65
5 1901 1.43 0.06
7 1903 1.19 0.47</code></pre>
</figure>
<p>
The data contain 86 observations of the annual maximum sea level (in meters) over the period 1897ā1989. The aim of the analysis is to account for any change in the distribution of annual maxima over time and to investigate any relationship with the Southern Oscillation Index, a measure of meteorological phenomena which reflects the development and intensity of El NiƱo events, and those of its counterpart La NiƱa, in the south Pacific. The data are shown below using <strong>ggplot</strong>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"cowplot"</span><span class="p">)</span><span class="w"> </span><span class="c1"># install.packages("cowplot") If not installed !</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SeaLevel</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">p2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SOI</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SeaLevel</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">plot_grid</span><span class="p">(</span><span class="n">p1</span><span class="p">,</span><span class="w"> </span><span class="n">p2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/modelling-extremes-with-gams-post-1-packages-and-plot-1.png" alt="Time series of annual sea-level maxima at Fremantle, Western Australia (top) and the relationship between annual sea-level maxima and the Southern Oscillation Index" />
<figcaption>
Time series of annual sea-level maxima at Fremantle, Western Australia (top) and the relationship between annual sea-level maxima and the Southern Oscillation Index
</figcaption>
</figure>
<p>
In extreme value analysis, one of the key components is to assess the behaviour of the very large, or small, events/observations, and often the focus is on those that are much more extreme than those in the observational record. This requires a considerably different approach to the usual statistical methods that focus on the mean of a distribution. Whilst we could approach the analysis of data like that in <code>fremantle</code> from the view point of traditional methods employing the Gaussian distribution, the events of interest, the extreme high sea-level events, are way off in the tails of a distribution fitted by considering (usually) just its mean (and variance). Even small uncertainties in estimation of the distribution can be amplified when we get out into the extreme tails of the Gaussian, complicating inference about extremes and inflating uncertainties.
</p>
<p>
Extreme value theory has developed separate models and limiting distributions that replace central role that the Gaussian distribution plays in other areas of statistical modeling and inference. Consider again the sea-level data; the sea level would have been measured daily (or roughly daily) at Fremantle in order to produce the annual maximum series we wish to analyze. For a single year, we might denote these daily observations by <span class="math inline">(Z_1, , Z_m)</span> and weāll assume that these are a random sample of sea-level values. The annual maximum is given by
</p>
<p>
<span class="math display">[ Y_m = { Z_1, , Z_m } ]</span>
</p>
<p>
<span class="math inline">(Y_m)</span> are commonly known as <em>block maxima</em> ā the maxima of a block of random variables <span class="math inline">(Z_m)</span>. Extreme value theory considers the limiting distribution of <span class="math inline">(Y_m)</span> as <span class="math inline">(m)</span> tends to infinity. More simply, we want derive the distribution of annual maximum sea-level values as the number of annual maxima tends to infinity. The limiting distribution for <span class="math inline">(Y_m)</span> is restricted to the class of generalized extreme value distributions (GEV), which have the following form
</p>
<p>
<span class="math display">[ G(y) = { - _{+}^{-1/} } ]</span>
</p>
<p>
where <span class="math inline">()</span>, <span class="math inline">(> 0)</span>, and <span class="math inline">()</span> are the location, (positive) scale, and shape parameters respectively of the distribution. The distribution has support on values <span class="math inline">(y)</span> where <span class="math inline">(1 + (y - / ) > 0)</span>, which is indicated by the subscript <span class="math inline">(+)</span> in the main equation above. <span class="math inline">()</span> and <span class="math inline">()</span> can take any real value in the range <span class="math inline">(-)</span>ā<span class="math inline">(+)</span>, whereas <span class="math inline">()</span> the scale, or variance, parameter can be any positive real value.
</p>
<p>
The GEV distribution encompasses the three potential extreme value distributions for block maxima:
</p>
<p>
I. the <strong>Gumbel</strong> distribution, II. the <strong>FrƩchet</strong> distribution, and III. the <strong>Weibul</strong> distribution.
</p>
<p>
These are also known as the Type I, II, and III extreme value distributions. Though I wonāt write out the equations for each of these distributions, they are all quite similar to the GEV distribution and have parameters <span class="math inline">()</span>, <span class="math inline">()</span>, whilst the FrĆ©chet and Weibull distributions also have a shape parameter <span class="math inline">()</span>. The distributions differ markedly at the extreme positive end of <span class="math inline">(y)</span>, <span class="math inline">(y_{+})</span>; the Weibull is finite, but both the FrĆ©chet and Gumbel distributions are infinite, being distinguished by having polynomially and exponentially decaying density respectively. Each of these distributions can be reached from the GEV
</p>
<ul>
<li>
the Gumbel is reached when <span class="math inline">(= 0)</span>,
</li>
<li>
the FrƩchet when <span class="math inline">()</span> is <em>positive</em> (<span class="math inline">(> 0)</span>), and
</li>
<li>
the Weibull when <span class="math inline">()</span> is <em>negative</em> (<span class="math inline">(< 0)</span>)
</li>
</ul>
<p>
Traditionally, researchers had to decide which type of tail behaviour they expected prior to fitting one of the three extreme value distributions. The clear advantage of the GEV is that the choice of distribution is now a parameter that can be included in the model fitting process leading to fewer <em>a priori</em> decisions needing to be made ahead of the analysis.
</p>
<p>
As I mentioned above, the <code>gevlss()</code> family allows for separate linear predictors <span class="math inline">()</span> for each of the parameters <span class="math inline">()</span>, <span class="math inline">()</span>, and <span class="math inline">()</span>, to depend on one or more covariates. When setting up this model, therefore, we need to specify not one formula, but three. These are supplied in a list, with only the first having a left hand side term for the response.
</p>
<p>
The first model considered by <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span> allowed for a smooth trend in <code>Year</code> and a smooth effect of <code>SOI</code> in the linear predictors for <span class="math inline">()</span> and <span class="math inline">()</span> whilst <span class="math inline">()</span> was modeled as an intercept only linear predictor. The reason for the simple linear predictor for <span class="math inline">()</span> is that this parameter is exceedingly difficult to estimate from data; in a relatively small data set like the <code>fremantle</code> one there is very little information with which to inform <span class="math inline">()</span>.
</p>
<p>
To specify this model in <code>gam()</code> we need to create a list of three formula objects as follows:
</p>
<div id="cb1" class="sourceCode">
<pre class="sourceCode r"><code class="sourceCode r"><a class="sourceLine" id="cb1-1" title="1"><span class="kw">list</span>(SeaLevel <span class="op">~</span><span class="st"> </span><span class="kw">s</span>(cYear) <span class="op">+</span><span class="st"> </span><span class="kw">s</span>(SOI),</a>
<a class="sourceLine" id="cb1-2" title="2"> <span class="op">~</span><span class="st"> </span><span class="kw">s</span>(cYear) <span class="op">+</span><span class="st"> </span><span class="kw">s</span>(SOI),</a>
<a class="sourceLine" id="cb1-3" title="3"> <span class="op">~</span><span class="st"> </span><span class="dv">1</span>)</a></code></pre>
</div>
<p>
Key points to note here are
</p>
<ul>
<li>
The ordering of the formula components is <span class="math inline">()</span>, <span class="math inline">()</span>, and <span class="math inline">()</span>,
</li>
<li>
only the first formula, for <span class="math inline">()</span>, has a left hand side specifying the response variable, in this case <code>SeaLevel</code>,
</li>
<li>
the second and third formulas are right-hand sided only and start with a <code>~</code>,
</li>
<li>
intercept-only linear predictors are indicated by the formula <code>~ 1</code>
</li>
</ul>
<p>
This model can be thought of as an extended GLM and as such, each linear predictor is associated with a link function. The default links for <span class="math inline">()</span>, <span class="math inline">()</span>, and <span class="math inline">()</span> in the <code>gevlss()</code> family are <code>āidentityā</code>, <code>āidentityā</code> and <code>ālogitā</code> respectively, although technically the linear predictor for <span class="math inline">()</span>, <span class="math inline">(_{})</span>, is for the <em>log scale parameter</em> and hence the default identity link implies a fixed log link for <span class="math inline">()</span>. Additionally, the <code>ālogitā</code> link for <span class="math inline">()</span> is modified to restrict the range to -1 ā 0.5. To match the model fitted by <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span>, the identity link is used for all three parameters.
</p>
<p>
Finally, note that the <strong>VGAM</strong> package requires the user to specify the degrees of freedom for each smooth term and the software searches for a smoothing parameter that achieves the required degrees of freedom. <strong>mgcv</strong> takes a different tack; the user specifies the dimension of the basis (the number of basis functions) to use for each smooth term and then <em>it</em> chooses smoothness parameters via penalized likelihood to maximize a log-marginal or log-restricted marginal likelihood. Assuming that the dimension of the basis is sufficiently rich to include the true but unknown smooth function, the <strong>mgcv</strong> approach avoids the user having to state <em>a priori</em> how wiggly each smooth term should be.
</p>
<p>
<span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span> used three degrees of freedom splines for each smooth term. Here I leave the basis dimension at the (essentially arbitrary) default value of 10. It will be instructive to see what smoothness parameters are selected as optimal, how <strong>mgcv</strong> copes with estimating smoothness in a relatively complex setting, and how the estimated smooths compare with those assumed by <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span>.
</p>
<p>
One final tweak is required; the estimates of the intercept terms for <span class="math inline">()</span> and <span class="math inline">()</span> would imply extrapolation backwards in time 2,000 years. It can help numerical stability when fitting if we centre <code>Year</code> about say the middle of the time series, which we do now before proceeding
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">fremantle</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">cYear</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">Year</span><span class="p">))</span></code></pre>
</figure>
<p>
With that out of the way, the model is fitted with relative ease as follows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">SeaLevel</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">cYear</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">SOI</span><span class="p">),</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">cYear</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">SOI</span><span class="p">),</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gevlss</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)))</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gevlss
Link function: identity identity identity
Formula:
SeaLevel ~ s(cYear) + s(SOI)
~s(cYear) + s(SOI)
~1
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.49567 0.01517 98.577 <2e-16 ***
(Intercept).1 -2.13680 0.08853 -24.135 <2e-16 ***
(Intercept).2 -0.25472 0.08851 -2.878 0.004 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(cYear) 1.000 1.000 15.030 0.000106 ***
s(SOI) 1.366 1.650 13.549 0.000554 ***
s.1(cYear) 2.032 2.546 4.922 0.129164
s.1(SOI) 1.000 1.000 6.461 0.011026 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Deviance explained = NA%
-REML = -41.116 Scale est. = 1 n = 86</code></pre>
</figure>
<p>
The <code>summary()</code> output is similar to that of standard GAMs, except the convention is to append <code>.N</code>, where <code>N</code> is a positive integer, to terms for (confusingly) the second and third linear predictors respectively. The parametric terms are listed first.
</p>
<p>
Of interest here for this model is the estimate of <span class="math inline">()</span>, which is negative, -0.25 (with standard error 0.09 yielding approximate 95% confidence interval -0.43 ā -0.08), indicating a Weibull-type distribution for the annual sea-level maxima. The values reported by <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span> are <span class="math inline">()</span> = -0.27, with standard error 0.06.
</p>
<p>
The smooth terms are listed next, and with the exception of the smooth of <code>Year</code> in <span class="math inline">(_{})</span>, all the estimated smooths have been penalized to (effectively) linear functions. The partial effect of each smooth can be plotted using the <code>plot()</code> method for <code>gam</code> models
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">pages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scheme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">seWithMean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/modelling-extremes-with-gams-post-1-plot-m1-smooths-1.png" alt="Fitted smooths for model m1 which uses penalized splines for the smooths of Year and SOI in the linear predictors for the location and scale parameters of the GEV distribution" />
<figcaption>
Fitted smooths for model <code>m1</code> which uses penalized splines for the smooths of <code>Year</code> and <code>SOI</code> in the linear predictors for the location and scale parameters of the GEV distribution
</figcaption>
</figure>
<p>
As reported in <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span>, the fitted smooth of <code>Year</code> in <span class="math inline">(_{})</span> (lower left panel) is somewhat non-linear, with partial effect of decreasing variance in sea-levels through c.Ā 1945 and increasing variance thereafter. <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span> suggest that this smooth may be replaced by a piece-wise linear function with a knot round 1945. The authors also simplified the model by replacing the smooths for all the other variables with linear parametric terms. We will investigate this model next.
</p>
<p>
I havenāt quite worked out how to get <code>gam()</code> to fit a piece-wise linear function yet, but the approach below is pretty close. The following model uses the new b-spline basis in <strong>mgcv</strong>, which allows a lot of control over how the basis is set up. In basic R, a piece-wise linear basis with interior knot at 1945 would be created using <code>splines::bs(Year, degree = 1, knots = 1945)</code>, but then as far as <code>gam()</code> is concerned, the resulting basis functions are simply two continuous covariates that are treated a linear parametric terms.
</p>
<p>
We can use the new b-spline basis to achieve something similar to (the same as ?) <code>splines::bs</code> if we set the knot locations explicitly and use <code>m = 1</code> (for linear splines) and basis dimension <code>k = 3</code>. If you are setting the knots manually, then for the b-spline basis in <strong>mgcv</strong> you need to specify <code>k + m + 1</code> (5) knots and the middle <code>k - m + 1</code> (3) knots should include all the covariate values. Iām not sure what determines where the two exterior knots should be located; in the code below I just locate the at +/- 10 years from the extremes of the data. The knot locations then are specified as a list with component <code>cYear</code> (to match the covariate name), and as weāre modeling with the centred <code>Year</code>, I centre the knot locations using the middle year as before.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">knots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">cYear</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">Year</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="m">1945</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Year</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">))</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">median</span><span class="p">(</span><span class="n">Year</span><span class="p">)))</span></code></pre>
</figure>
<p>
The GAM can then be specified as before with three formulas. The type of smooth for <code>cYear</code> in <span class="math inline">(_{})</span> is specified via <code>bs = ābsā</code> and the remaining parameters of the basis are as described above. The list of knots we just created is passed to the <code>knots</code> argument.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">SeaLevel</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">cYear</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">SOI</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">cYear</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"bs"</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">SOI</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gevlss</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)),</span><span class="w">
</span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gevlss
Link function: identity identity identity
Formula:
SeaLevel ~ cYear + SOI
~s(cYear, bs = "bs", m = 1, k = 3) + SOI
~1
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5010226 0.0153061 98.067 < 2e-16 ***
cYear 0.0019503 0.0005139 3.795 0.000148 ***
SOI 0.0682778 0.0175807 3.884 0.000103 ***
(Intercept).1 -2.1230063 0.0882978 -24.044 < 2e-16 ***
SOI.1 0.2894395 0.1145038 2.528 0.011479 *
(Intercept).2 -0.2543328 0.0885032 -2.874 0.004057 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s.1(cYear) 1.465 2 4.875 0.036 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Deviance explained = NA%
-REML = -39.611 Scale est. = 1 n = 86</code></pre>
</figure>
<p>
The summary output indicates significant linear parametric effects of <code>cYear</code> and <code>SOI</code> in <span class="math inline">(<em>{})</span>, and <code>SOI</code> in <span class="math inline">(</em>{})</span>. There is now some evidence of an effect of <code>SOI</code> on the variance of the block maxima, although we would be right to treat this result with caution as the piece-wise linear structure was only guessed at after fitting the more general smooth term, which was not statistically significant. <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span> performed an informal deviance test between the two models, which we repeat here
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">lldif</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">unclass</span><span class="p">(</span><span class="n">logLik</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">logLik</span><span class="p">(</span><span class="n">m2</span><span class="p">))</span><span class="w">
</span><span class="n">dfdif</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">m1</span><span class="p">)</span><span class="w">
</span><span class="n">pchisq</span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">lldif</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dfdif</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0.3693541
attr(,"df")
[1] 9.1958</code></pre>
</figure>
<p>
the results of which match those published and suggest that the simpler model with the piece-wise linear smooth of <code>Year</code> in <span class="math inline">(_{})</span> is sufficient to describe the effect on the variance of the sea-level maxima.
</p>
<p>
The fitted piece-wise linear smooth can be plotted using the <code>plot()</code> method as before. To get the linear terms plotted we need to use to <code>all.terms = TRUE</code> option
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">pages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scheme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">seWithMean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">all.terms</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/modelling-extremes-with-gams-post-1-plot-m2-smooths-1.png" alt="Fitted smooths and parametric terms for model m2 which uses a piece-wise linear spline for the effect of Year on the scale parameter" />
<figcaption>
Fitted smooths and parametric terms for model <code>m2</code> which uses a piece-wise linear spline for the effect of <code>Year</code> on the scale parameter
</figcaption>
</figure>
<p>
This plot is a little more clunky than the previous one as the linear terms are plotted via calls to <code>termplot()</code> and the way this is achieved in <code>plot.gam()</code> doesnāt allow for separate y-axis limits for the linear terms (<code>scale = 0</code>) and the <code>scheme</code> argument does not affect these plots either.
</p>
<p>
If we wanted an entirely data-driven approach to fitting the smooth of <code>Year</code> in <span class="math inline">(_{})</span>, and wanted to crack that particular nut with an industrial-sized wrecking ball, we could use the adaptive spline basis by changing the basis type for the smooth to <code>bs = āadā</code> as follows (note this takes a while to fit)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">SeaLevel</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">cYear</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">SOI</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">cYear</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ad"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">SOI</span><span class="p">,</span><span class="w">
</span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fremantle</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gevlss</span><span class="p">(</span><span class="n">link</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">,</span><span class="w"> </span><span class="s2">"identity"</span><span class="p">)))</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">m3</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gevlss
Link function: identity identity identity
Formula:
SeaLevel ~ cYear + SOI
~s(cYear, bs = "ad") + SOI
~1
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.5025023 0.0154310 97.369 < 2e-16 ***
cYear 0.0019770 0.0005175 3.820 0.000133 ***
SOI 0.0689087 0.0175007 3.937 8.23e-05 ***
(Intercept).1 -2.1203031 0.0899383 -23.575 < 2e-16 ***
SOI.1 0.2917915 0.1132973 2.575 0.010011 *
(Intercept).2 -0.2677254 0.0935770 -2.861 0.004223 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s.1(cYear) 1.816 2.046 5.86 0.0563 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Deviance explained = NA%
-REML = -40.975 Scale est. = 1 n = 86</code></pre>
</figure>
<p>
Again, there is some evidence of a trend in the variance of the sea-level maxima; the higher <em>p</em>-value here likely reflects the additional uncertainty arising from having to deduce the shape and varying wiggliness of the spline from the data directly. The resulting smooth is largely indistinguishable from the piece-wise linear one in <code>m2</code>, except for the smooth transition around 1945.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m3</span><span class="p">,</span><span class="w"> </span><span class="n">pages</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">scheme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">seWithMean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">all.terms</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/modelling-extremes-with-gams-post-1-plot-m3-smooths-1.png" alt="Fitted smooths and parametric terms for model m3 which uses an adaptive spline for the effect of Year on the scale parameter" />
<figcaption>
Fitted smooths and parametric terms for model <code>m3</code> which uses an adaptive spline for the effect of <code>Year</code> on the scale parameter
</figcaption>
</figure>
<p>
My attempt to replicate the analysis of <span class="citation" data-cites="Yee2007-rz">Yee and Stephenson (2007)</span> was largely devoid of any troubles despite the <code>gevlss()</code> family being both new and described by Simon as āsomewhat experimentalā. The main difficulty was in trying to get a piece-wise linear spline within the <strong>mgcv</strong> framework, largely because doing it via <code>splines::bs()</code> makes it much more difficult to plot the partial effect of the overall function with the easily accessible tools that <strong>mgcv</strong> provides.
</p>
<p>
One area where <strong>mgcv</strong> is lacking in relation to <strong>VGAM</strong> for fitting GEV models is in the array of support functions that go with the fitted models ā <strong>VGAM</strong> has lots of plot types specific to extreme value models that help with interpreting and checking the fitted model. In a future post I may try to tackle some of this using <strong>mgcv</strong>, if I find the time.
</p>
<p>
This is hopefully the first of several posts on modeling block maxima using <strong>mgcv</strong> and GAMs, so if you have any comments, suggests, corrections, let me know in the comments below.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Coles2001-zz">
<p>
Coles, S. (2001). <em>An introduction to statistical modeling of extreme values:</em> Springer London doi:<a href="https://doi.org/10.1007/978-1-4471-3675-0">10.1007/978-1-4471-3675-0</a>.
</p>
</div>
<div id="ref-Yee2007-rz">
<p>
Yee, T. W., and Stephenson, A. G. (2007). Vector generalized linear and additive extreme value models. <em>Extremes</em> 10, 1ā19. doi:<a href="https://doi.org/10.1007/s10687-007-0032-4">10.1007/s10687-007-0032-4</a>.
</p>
</div>
</div>
Pangaea and R and open palaeo data
Gavin L. Simpson
2016-12-16T00:00:00-06:00
2016-12-16T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/12/16/pangaea-r-open-palaeo-data/
<p>
For a while now, Iāve been wanting to experiment with rOpenSciās <strong>pangaear</strong> package <span class="citation" data-cites="pangaear-024">(Chamberlain et al., 2016)</span>, which allows you to search, and download data from, the Pangaea, a major data repository for the earth and environmental sciences. Earlier in the year, as a member of the editorial board of <a href="http://www.nature.com/sdata/">Scientific Data</a>, Springer Natureās open data journal I was handling a data descriptor submission that described a new 2,200-year foraminiferal Ī“<sup><sup>18</sup></sup>O record from the Gulf of Taranto in the Ionian Sea <span class="citation" data-cites="Taricco2016-pv">(Taricco et al., 2016)</span>. The data descriptor was recently <a href="http://doi.org/10.1038/sdata.2016.42">published</a> and as part of the submission Carla Taricco deposited the data set in Pangaea. So, what better opportunity to test out <strong>pangaear</strong>? (Oh and to fit a GAM to the data while Iām at it!)
</p>
<div id="refs" class="references">
<div id="ref-pangaear-024">
<p>
Chamberlain, S., Woo, K., MacDonald, A., Zimmerman, N., and Simpson, G. (2016). <em>Pangaear: Client for the āpangaeaā database</em>. Available at: <a href="https://CRAN.R-project.org/package=pangaear">https://CRAN.R-project.org/package=pangaear</a>.
</p>
</div>
<div id="ref-Taricco2016-pv">
<p>
Taricco, C., Alessio, S., Rubinetti, S., Vivaldo, G., and Mancuso, S. (2016). A foraminiferal <span class="math inline">()</span>18O record covering the last 2,200 years. <em>Scientific Data</em> 3, 160042. doi:<a href="https://doi.org/10.1038/sdata.2016.42">10.1038/sdata.2016.42</a>.
</p>
</div>
</div>
<p>
For a while now, Iāve been wanting to experiment with rOpenSciās <strong>pangaear</strong> package <span class="citation" data-cites="pangaear-024">(Chamberlain et al., 2016)</span>, which allows you to search, and download data from, the Pangaea, a major data repository for the earth and environmental sciences. Earlier in the year, as a member of the editorial board of <a href="http://www.nature.com/sdata/">Scientific Data</a>, Springer Natureās open data journal I was handling a data descriptor submission that described a new 2,200-year foraminiferal Ī“<sup><sup>18</sup></sup>O record from the Gulf of Taranto in the Ionian Sea <span class="citation" data-cites="Taricco2016-pv">(Taricco et al., 2016)</span>. The data descriptor was recently <a href="http://doi.org/10.1038/sdata.2016.42">published</a> and as part of the submission Carla Taricco deposited the data set in Pangaea. So, what better opportunity to test out <strong>pangaear</strong>? (Oh and to fit a GAM to the data while Iām at it!)
</p>
<p>
The post makes use of the following packages: <strong>pangaear</strong> (obviously), <strong>mgcv</strong> and <strong>ggplot2</strong> for modelling and plotting, and <strong>tibble</strong> because <strong>pangaear</strong> returns search results and data sets in tibbles that I need to manipulate before I can fit a GAM to the Ī“<sup><sup>18</sup></sup>O record.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"pangaear"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"tibble"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span></code></pre>
</figure>
<p>
To download a data set from Pangaea you need to know the DOI of the deposit. If you donāt know the Pangaear DOI, you can search the data records held by Pangaea for specific terms. In <strong>pangaear</strong> searching is done using the <code>pg_search()</code> function. To find the data set I want, Iām going to search for records that have the string <code>āTariccoā</code> in the citation.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">recs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pg_search</span><span class="p">(</span><span class="n">query</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"citation:Taricco"</span><span class="p">)</span><span class="w">
</span><span class="n">recs</span><span class="o">$</span><span class="n">citation</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment cores from the Gulf of Taranto (Italy)"
[2] "Taricco, C; Alessio, S; Rubinetti, S et al. (2016): A foraminiferal d18O record of sediment core GT90-3 covering the last 2,200 years"
[3] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment core GT89-3"
[4] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of a combined sediment core"
[5] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment core GT91-1"
[6] "Versteegh, GJM; de Leeuw, JW; Taricco, C et al. (2007): Alkenone-derived UK'37 data and sea surface temperatures (SST) of sediment core GT90-3" </code></pre>
</figure>
<p>
Assuming that the query didnāt time out (Pangaea can be a little slow to respond on occasion, so you might find increasing the timeout on the query helps), <code>recs</code> shoud contain 6 records with <code>āTariccoā</code> in the citation. The one we want is the second entry.
</p>
<p>
To download the data object(s) associated with a record in Pangaea, we use the <code>pg_data()</code> function, supplying it with a single DOI.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pg_data</span><span class="p">(</span><span class="n">doi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">recs</span><span class="o">$</span><span class="n">doi</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w"> </span><span class="c1"># doi = "10.1594/PANGAEA.857573"</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Downloading 1 datasets from 10.1594/PANGAEA.857573</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Processing 1 files</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">res</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="m">1</span><span class="p">]],</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[[1]]
<Pangaea data> 10.1594/PANGAEA.857573
# A tibble: 560 Ć 4
`Depth [m]` `Age [a AD]` `Age [ka BP]` `G. ruber d18O [per mil PDB]`
<dbl> <dbl> <dbl> <dbl>
1 1.4000 -188.20 2.13820 0.742
2 1.3975 -184.33 2.13433 0.290
3 1.3950 -180.46 2.13046 0.706
4 1.3925 -176.59 2.12659 0.356
5 1.3900 -172.72 2.12272 0.558
6 1.3875 -168.85 2.11885 0.746
7 1.3850 -164.98 2.11498 0.346
8 1.3825 -161.11 2.11111 0.554
9 1.3800 -157.24 2.10724 0.510
10 1.3775 -153.37 2.10337 0.543
# ... with 550 more rows
List of 4
$ doi : chr "10.1594/PANGAEA.857573"
$ citation:List of 1
..- attr(*, "class")= chr "citation"
$ meta :List of 1
..- attr(*, "class")= chr "meta"
$ data :Classes 'tbl_df', 'tbl' and 'data.frame': 560 obs. of 4 variables:
- attr(*, "class")= chr "pangaea"</code></pre>
</figure>
<p>
In Pangaea, a DOI might refer to a collection of data objects, in which case the object returned by <code>pg_data()</code> would be a list with as many components as objects in the collection. In this instance there is but a single data object associated with the requested DOI, but for consistency it is returned in a list with a single component.
</p>
<p>
Rather than work with the <code>pangaea</code> object directly, for modelling or plotting it is, for the moment at least, going to be simpler if we extract out the data object, which is stored in the <code><span class="math inline">\(data</code> component. Weāll also want to tidy up those variable/column names</p> <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">foram</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">\)</span></span><span class="n">data</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">foram</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">āDepthā</span><span class="p">,</span><span class="w"> </span><span class="s2">āAge_ADā</span><span class="p">,</span><span class="w"> </span><span class="s2">āAge_kaBPā</span><span class="p">,</span><span class="w"> </span><span class="s2">ād18Oā</span><span class="p">)</span><span class="w"> </span><span class="n">foram</span></code>
</pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"># A tibble: 560 Ć 4
Depth Age_AD Age_kaBP d18O
<dbl> <dbl> <dbl> <dbl>
1 1.4000 -188.20 2.13820 0.742
2 1.3975 -184.33 2.13433 0.290
3 1.3950 -180.46 2.13046 0.706
4 1.3925 -176.59 2.12659 0.356
5 1.3900 -172.72 2.12272 0.558
6 1.3875 -168.85 2.11885 0.746
7 1.3850 -164.98 2.11498 0.346
8 1.3825 -161.11 2.11111 0.554
9 1.3800 -157.24 2.10724 0.510
10 1.3775 -153.37 2.10337 0.543
# ... with 550 more rows</code></pre>
</figure>
<p>
Now thatās done, we can take a look at the data set
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ylabel</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">delta</span><span class="o">^</span><span class="p">{</span><span class="m">18</span><span class="p">}</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">O</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s2">"[ā° VPDB]"</span><span class="p">)</span><span class="w">
</span><span class="n">xlabel</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Age [ka BP]"</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">foram</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d18O</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Age_kaBP</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_reverse</span><span class="p">(</span><span class="n">sec.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sec_axis</span><span class="p">(</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1950</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">.</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Age [AD]"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_reverse</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylabel</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlabel</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-plot-data-1.png" alt="The Ī“^18^O record of Taricco et al (2016)" />
<figcaption>
The Ī“<sup><sup>18</sup></sup>O record of Taricco <em>et al</em> (2016)
</figcaption>
</figure>
<p>
Notice that the x-axis has been reversed on this plot so that as we move from left to right the observations become younger, as is standard for a time series. In the code block above Iāve used <code>sec_axis()</code> to add an AD scale to the x-axis. This is a new feature in version 2.2.0 of <strong>ggplot2</strong> which allows secondary axis that is a one-to-one transformation of the main scale. This isnāt quite right as the two scales donāt map in a fully one-to-one fashion; as there is no year 0AD (or 0<abbrv, title = āBefore Common Eraā>BCE</abbr>), the scale will be a year out for the BCE period.
</p>
<p>
Note too that the y-axis has also been reversed, to match the published versions of the data. This is done in those publications because Ī“<sup><sup>18</sup></sup>O has an interpretation as temperature, with lower Ī“<sup><sup>18</sup></sup>O indicating higher temperatures. As is common for data from proxies that have a temperature interpretation, the values are plotted in a way that <em>up</em> on the plot means <em>warmer</em> and <em>down</em> means colder.
</p>
<p>
To model the data in the same time ordered way using the year BP variable we need to create a variable that is the negative of <code>Age-kaBP</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">foram</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">foram</span><span class="p">,</span><span class="w"> </span><span class="n">add_column</span><span class="p">(</span><span class="n">foram</span><span class="p">,</span><span class="w"> </span><span class="n">Age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">Age_kaBP</span><span class="p">))</span></code></pre>
</figure>
<p>
Note that we donāt want to use the <code>Age_AD</code> scale for this as this has the problem of having a discontinuity at 0AD (which doesnāt exist).
</p>
<p>
Now we can fit a GAM to the Ī“<sup><sup>18</sup></sup>O record
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">d18O</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Age</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ad"</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">foram</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w"> </span><span class="n">select</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<p>
In this instance I used an adaptive spline basis <code>bs = āadā</code>, which allows the degree of wiggliness to vary along the fitted function. With a relatively large data set like this, which has over 500 observations, using an adaptive smoother can provide a better fit to the observations, and it is especially useful in situations where it is plausible that the response will vary more over some time periods than others. Adaptive smooths arenāt going to work well in short time series; there just isnāt the information available to estimate what in effect can be thought of as several separate splines over small chunks of the data. That said, Iāve had success with data sets with about 100ā200 observations. Also note that fitting an adaptive smoother requires cranking the CPU over a lot more calculations; be aware of that if you throw a very large data set at it.
</p>
<p>
Also note that the model was fitted using REML ā in most cases this is the default you want to be using as GCV can undersmooth in some circumstances. The double penalty approach of <span class="citation" data-cites="Marra2011-sf">Marra and Wood (2011)</span> is used here too (<code>select = TRUE</code>), which in this instance is being used to apply a bit of shrinkage to the fitted trend; itās good to be a little conservative at times.
</p>
<p>
The model diagnostics look OK for this model and the check of sufficient dimensionality in the basis doesnāt indicate anything to worry about (partly because we used a large basis in the first place: <code>99 = k - 1 = 100 - 1</code>)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">gam.check</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="c1">## RStudio users might need</span><span class="w">
</span><span class="c1">## layout(matrix(1:4, ncol = 2, byrow = TRUE))</span><span class="w">
</span><span class="c1">## gam.check(m)</span><span class="w">
</span><span class="c1">## layout(1)</span><span class="w">
</span><span class="c1">## to see all the plots on one device</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Method: REML Optimizer: outer newton
full convergence after 26 iterations.
Gradient range [-0.0003244869,0.000148452]
(score -128.435 & scale 0.03324845).
Hessian positive definite, eigenvalue range [6.249446e-06,279.6116].
Model rank = 100 / 100
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(Age) 99.000 15.539 0.993 0.44</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-gam-check-1.png" alt="Diagnostic plots for the fitted GAM" />
<figcaption>
Diagnostic plots for the fitted GAM
</figcaption>
</figure>
<p>
and the fitted trend is <em>inconsistent</em> with a null-model of no trend
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
d18O ~ s(Age, k = 100, bs = "ad")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.455329 0.007705 59.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Age) 15.54 99 3.357 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.373 Deviance explained = 39%
-REML = -128.44 Scale est. = 0.033248 n = 560</code></pre>
</figure>
<p>
There is a lot of variation about the fitted trend, but a model with about 15 degrees of freedom explains about 40% of the variance in the data set, which is pretty good.
</p>
<p>
While we could use the provided <code>plot()</code> method for <code>āgamā</code> objects to draw the fitted function, I now find myself prefering plotting with <strong>ggplot2</strong>. To recreate the sort of plot that <code>plot.gam()</code> would produce, we first need to predict for a fine grid of values, here 200 values, over the observed time interval. <code>predict.gam()</code> is used to generate predictions and standard errors; the standard errors requested here use a new addition to <strong>mgcv</strong> which includes the extra uncertainty in the model because we are also estimating the smoothness parameters (the parameters that control the degree of wiggliness in the spline). This is achieved through the use of <code>unconditional = TRUE</code> in the call to <code>predict()</code>. The standard errors you get with the default, <code>unconditional = FALSE</code>, assume that the smoothness parameters, and therefore the amount of wiggliness, is known before fitting, which is rarely the case. This doesnāt make much difference in this example, but I thought Iād mention it as it is a relatively new addition to <strong>mgcv</strong>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">foram</span><span class="p">,</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">Age_kaBP</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Age_kaBP</span><span class="p">),</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">unconditional</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span><span class="w">
</span><span class="n">Fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">,</span><span class="w">
</span><span class="n">Upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">),</span><span class="w">
</span><span class="n">Lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">),</span><span class="w">
</span><span class="n">Age_kaBP</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">Age</span><span class="p">)</span></code></pre>
</figure>
<p>
The code above uses these standard errors to create an approximate 95% point-wise confidence on the fitted function, and prepares this in tidy format for plotting with <code>ggplot()</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">foram</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">d18O</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Age_kaBP</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Age_kaBP</span><span class="p">,</span><span class="w"> </span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Upper</span><span class="p">),</span><span class="w">
</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"grey"</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.7</span><span class="p">,</span><span class="w"> </span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Age_kaBP</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Fitted</span><span class="p">),</span><span class="w"> </span><span class="n">inherit.aes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_reverse</span><span class="p">(</span><span class="n">sec.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sec_axis</span><span class="p">(</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1950</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">.</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Age [AD]"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_reverse</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylabel</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlabel</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-final-plot-1.png" alt="Observed Ī“^18^O values with the fitted trend and 95% point-wise confidence interval superimposed" />
<figcaption>
Observed Ī“<sup><sup>18</sup></sup>O values with the fitted trend and 95% point-wise confidence interval superimposed
</figcaption>
</figure>
<p>
<span class="citation" data-cites="Taricco2009-eh">Taricco et al.Ā (2009)</span> used singular spectrum analysis (SSA), among other spectral methods, to decompose the Ī“<sup><sup>18</sup></sup>O time series into components of variability with a range of periodicities. A visual comparison with the SSA components and the fitted GAM trend, suggests that the GAM trend maps on to the sum of the long-term trend component plus the ~600 year and (potentially) the 350 year frequency components of the SSA. This does make we wonder a little about how real the higher frequency components identified in the SSA are? No matter how hard I tried (even setting the basis dimension of the GAM to <code>k = 500</code>) I couldnāt get it to be more wiggly than shown in the plots above). Figure 4 of <span class="citation" data-cites="Taricco2009-eh">Taricco et al.Ā (2009)</span> also showed the spectral power for the 4 non-trend components from the SSA. The power associated with the 200-year and the 125-year components is substantially less than that of the two longer-frequency components. The significance of the SSA components was determined using a Monte Carlo approach <span class="citation" data-cites="Allen1996-gc">(Allen and Smith, 1996)</span>, where surrogate time series are generate using AR(1) noise. Itās reasonable to ask whether this is a reasonable null model for these data? Itās also reasonable to ask whether the GAM approach I used above has sufficient statistical power to detect higher-freqency components if they actually exist? This warrants further study.
</p>
<p>
I started this post with some details on why I was prompted to look at this particular data set. Palaeo-scientists have a long record of sharing data ā less so in some specific fields: yes, Iām looking at you (& me), palaeolimnologists ā but, and perhaps this is just me, Iām seeing more of an open-data culture within palaeoecology and palaeoclimatology. This is great to see, and avenues for publishing and hence generating traditional academic merit for the data we generate will only help foster this. With my āeditorial board memberā hat on, I would encourage people to consider writing a data paper and submitting it to Scientific Data or one of the other data journals that are springing up. But, if you canāt or donāt want to do that, depositing your data in an open repository like Pangaea brings with it many benefits and is something that we the palaeo community should be supportive of. I wouldnāt have been writing this post if Tarrico and co-authors hadnāt chosen to make their data openly available.
</p>
<p>
And that brings me on to my final point for this post; having access to an excellent data repository like Pangaea from within a data analysis platform like R makes it so much easier to engaged with the literature and ask new and interesting questions. Iāve highlighted Pangaea here, but other initiatives are doing a great job of making palaeo data available and also deserve our recognition and support, like the <a href="http://www.neotomadb.org/">Neotoma</a> database; we might take access to these reources for granted, but implementing and maintaining web servers and APIs requires a lot of time, effort, and resources. Also, this post wouldnāt have been possible without the work of the wonderful <a href="https://ropensci.org/">rOpenSci</a> community that make available R packages to query the APIs of online repositories like Pangaea and Neotoma. Thank you!
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Allen1996-gc">
<p>
Allen, M. R., and Smith, L. A. (1996). Monte carlo SSA: Detecting irregular oscillations in the presence of colored noise. <em>Journal of climate</em> 9, 3373ā3404. doi:<a href="https://doi.org/10.1175/1520-0442(1996)009<3373:MCSDIO>2.0.CO;2">10.1175/1520-0442(1996)009<3373:MCSDIO>2.0.CO;2</a>.
</p>
</div>
<div id="ref-pangaear-024">
<p>
Chamberlain, S., Woo, K., MacDonald, A., Zimmerman, N., and Simpson, G. (2016). <em>Pangaear: Client for the āpangaeaā database</em>. Available at: <a href="https://CRAN.R-project.org/package=pangaear">https://CRAN.R-project.org/package=pangaear</a>.
</p>
</div>
<div id="ref-Marra2011-sf">
<p>
Marra, G., and Wood, S. N. (2011). Practical variable selection for generalized additive models. <em>Computational statistics & data analysis</em> 55, 2372ā2387. doi:<a href="https://doi.org/10.1016/j.csda.2011.02.004">10.1016/j.csda.2011.02.004</a>.
</p>
</div>
<div id="ref-Taricco2016-pv">
<p>
Taricco, C., Alessio, S., Rubinetti, S., Vivaldo, G., and Mancuso, S. (2016). A foraminiferal <span class="math inline">()</span>18O record covering the last 2,200 years. <em>Scientific Data</em> 3, 160042. doi:<a href="https://doi.org/10.1038/sdata.2016.42">10.1038/sdata.2016.42</a>.
</p>
</div>
<div id="ref-Taricco2009-eh">
<p>
Taricco, C., Ghil, M., Alessio, S., and Vivaldo, G. (2009). Two millennia of climate variability in the central mediterranean. <em>Climate of the Past</em> 5, 171ā181. doi:<a href="https://doi.org/10.5194/cp-5-171-2009">10.5194/cp-5-171-2009</a>.
</p>
</div>
</div>
Simultaneous intervals for smooths revisited
Gavin L. Simpson
2016-12-15T00:00:00-06:00
2016-12-15T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/12/15/simultaneous-interval-revisited/
<p>
Eighteen months ago I <a href="https://www.fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/">wrote a post</a> in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalised spline. It was a nice post that attracted some interest. It was also wrong. I have no idea what I was thinking when I thought the intervals described in that post were simultaneous. Here I hope to rectify that past mistake.
</p>
<p>
Iāll tackle the issue of simultaneous intervals for the derivatives of penalised spline in a follow-up post. Here, I demonstrate one way to compute a simultaneous interval for a penalised spline in a fitted GAM. As example data, Iāll use the strontium isotope data set included in the <strong>SemiPar</strong> package, and which is extensively analyzed in the monograph <em>Semiparametric Regression</em> <span class="citation" data-cites="Ruppert2003-pt">(Ruppert et al., 2003)</span>. First, load the packages weāll need as well as the data, which is data set <code>fossil</code>. If you donāt have <strong>SemiPar</strong> installed, install it using <code>install.packages(āSemiParā)</code> before proceeding
</p>
<div id="refs" class="references">
<div id="ref-Ruppert2003-pt">
<p>
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). <em>Semiparametric regression</em>. Cambridge University Press.
</p>
</div>
</div>
<p>
Eighteen months ago I <a href="https://www.fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/">wrote a post</a> in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalised spline. It was a nice post that attracted some interest. It was also wrong. I have no idea what I was thinking when I thought the intervals described in that post were simultaneous. Here I hope to rectify that past mistake.
</p>
<p>
Iāll tackle the issue of simultaneous intervals for the derivatives of penalised spline in a follow-up post. Here, I demonstrate one way to compute a simultaneous interval for a penalised spline in a fitted GAM. As example data, Iāll use the strontium isotope data set included in the <strong>SemiPar</strong> package, and which is extensively analyzed in the monograph <em>Semiparametric Regression</em> <span class="citation" data-cites="Ruppert2003-pt">(Ruppert et al., 2003)</span>. First, load the packages weāll need as well as the data, which is data set <code>fossil</code>. If you donāt have <strong>SemiPar</strong> installed, install it using <code>install.packages(āSemiParā)</code> before proceeding
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"SemiPar"</span><span class="p">)</span></code></pre>
</figure>
<p>
The <code>fossil</code> data set includes two variables and is a time series of strontium isotope measurements on samples from a sediment core. The data are shown below using <code>ggplot()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">strontium.ratio</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-plot-fossil-data-1.png" alt="The strontium isotope example data used in the post" />
<figcaption>
The strontium isotope example data used in the post
</figcaption>
</figure>
<p>
The aim of the analysis of these data is to model how the measured strontium isotope ratio changed through time, using a GAM to estimate the clearly non-linear change in the response. I wonāt cover how the GAM is fitted and what all the options are here, but a reasonable GAM for these data is fitted using <strong>mgcv</strong> and <code>gam()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">strontium.ratio</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span></code></pre>
</figure>
<p>
The essentially arbitrary default for <code>k</code>, the basis dimension of the spline, is changed to <code>20</code> as there is a modest amount of non-linearity in the strontium isotopes ratio time series. By using <code>method = āREMLā</code>, the penalised spline model is expressed as a linear mixed model with the wiggly bits of the spline treated as random effects, and is estimated using restricted maximum likelihood; <code>method = āMLā</code> would also work here.
</p>
<p>
The fitted model uses ~12 effective degrees of freedom (which wouldnāt have been achievable with the default of <code>k = 10</code>!)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
strontium.ratio ~ s(age, k = 20)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.074e-01 2.435e-06 290527 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(age) 11.52 13.88 62.07 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.891 Deviance explained = 90.3%
-REML = -932.05 Scale est. = 6.2839e-10 n = 106</code></pre>
</figure>
<p>
The fitted spline captures the main variation in strontium isotope ratio values; the output from <code>plot.gam()</code> is shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">shade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">seWithMean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">residuals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-gam-plot-1.png" alt="The fitted penalised spline with approximate 95% point-wise confidence interval, as produced with plot.gam()" />
<figcaption>
The fitted penalised spline with approximate 95% point-wise confidence interval, as produced with <code>plot.gam()</code>
</figcaption>
</figure>
<p>
The confidence interval shown around the fitted spline is a 95% Bayesian credible interval. For reasons that donāt need to concern us right now, this interval has a surprising frequentist interpretation as a 95% <em>āacross the functionā</em> interval <span class="citation" data-cites="Nychka1988-rz Marra2012-bq">(Marra and Wood, 2012; Nychka, 1988)</span>; under repeated resampling from the population 95% of such confidence intervals will contain the true function. Such āacross the functionā intervals are quite intuitive, but, as weāll see shortly, they donāt reflect the uncertainty in the fitted function; far fewer than 95% of splines drawn from the posterior distribution of the fitted GAM would lie within the confidence interval shown in the plot above.
</p>
<p>
How to compute a simultaneous interval for a spline is a well studied problem and a number of solutions have been proposed in the literature. Here I follow <span class="citation" data-cites="Ruppert2003-pt">Ruppert et al.Ā (2003)</span> and use a simulation-based approach to generate a simultaneous interval. We proceed by considering a simultaneous confidence interval for a function <span class="math inline">(f(x))</span> at a set of <span class="math inline">(M)</span> locations in <span class="math inline">(x)</span>; weāll refer to these locations, following the notation of <span class="citation" data-cites="Ruppert2003-pt">Ruppert et al.Ā (2003)</span> by
</p>
<p>
<span class="math display">[ = (g_1, g_2, , g_M) ]</span>
</p>
<p>
The true function over <span class="math inline">()</span>, <span class="math inline">()</span>, is defined as the vector of evaluations of <span class="math inline">(f)</span> at each of the <span class="math inline">(M)</span> locations
</p>
<p>
<span class="math display">[ <span class="math display">\[\begin{align}
\mathbf{f_g} &amp;\equiv \begin{bmatrix}
f(g_1) \\
f(g_2) \\
\vdots \\
f({g_M}) \\
\end{bmatrix}
\end{align}\]</span> ]</span>
</p>
<p>
and the corresponding estimate of the true function given by the fitted GAM as <span class="math inline">()</span>. The difference between the true function and our unbiased estimator is given by
</p>
<p>
<span class="math display">[ <span class="math display">\[\begin{align}
\mathbf{\hat{f}_g} - \mathbf{f_g} &amp;= \mathbf{C_g} \begin{bmatrix}
\boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\
\mathbf{\hat{u}} - \mathbf{u} \\
\end{bmatrix}
\end{align}\]</span> ]</span>
</p>
<p>
where <span class="math inline">()</span> is the evaluation of the basis functions at the locations <span class="math inline">()</span>, and the thing in square brackets is the bias in the estimated model coefficients, which we assume to be mean 0 and follows, approximately, a multivariate normal distribution with mean vector <span class="math inline">()</span> and covariance matrix <span class="math inline">()</span>
</p>
<p>
<span class="math display">[
<span class="math display">\[\begin{bmatrix}
\boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\
\mathbf{\hat{u}} - \mathbf{u} \\
\end{bmatrix}\]</span>
N (, ) ]</span>
</p>
<p>
Having got those definitions out of the way, the 100(1 - <span class="math inline">()</span>)% simultaneous confidence interval is
</p>
<p>
<span class="math display">[ <span class="math display">\[\begin{align}
\mathbf{\hat{f}_g} &amp;\pm m_{1 - \alpha} \begin{bmatrix}
\widehat{\mathrm{st.dev}} (\hat{f}(g_1) - f(g_1)) \\
\widehat{\mathrm{st.dev}} (\hat{f}(g_2) - f(g_2)) \\
\vdots \\
\widehat{\mathrm{st.dev}} (\hat{f}(g_M) - f(g_M)) \\
\end{bmatrix}
\end{align}\]</span> ]</span>
</p>
<p>
where <span class="math inline">(m_{1 - })</span> is the 1 - <span class="math inline">()</span> quantile of the random variable
</p>
<p>
<span class="math display">[ <em>{x } | | </em>{1 M} | | ]</span>
</p>
<p>
Yep, that was <em>exactly</em> my reaction when I first read this section of <span class="citation" data-cites="Ruppert2003-pt">Ruppert et al.Ā (2003)</span>!
</p>
<p>
Letās deal with the left-hand side of the equation first. The <span class="math inline">()</span> refers to the <em>supremum</em> or the <em>least upper bound</em>; this is the least value of <span class="math inline">()</span>, the set of all values of which we observed subset <span class="math inline">(x)</span>, that is <em>greater</em> than all of the values in the subset. Often this is the maximum value of the subset. This is what is indicated by the right-hand side of the equation; we want the maximum (absolute) value of the ratio over all values in <span class="math inline">()</span>.
</p>
<p>
The fractions in both sides of the equation correspond to the standardized deviation between the true function and the model estimate, and we consider the <em>maximum absolute</em> standardized deviation. We donāt usually know the distribution of the maximum absolute standardized deviation but we need this to access its quantiles. However, we can closely approximate the distribution via simulation. The difference here is that rather than simulating from the posterior of the model as we have done in earlier posts on this blog, this time we simulate from the multivariate normal distribution with mean vector <span class="math inline">()</span> and covariance matrix <span class="math inline">()</span>, the Bayesian covariance matrix of the fitted model. For each simulation we find the maximum absolute standardized deviation of the fitted function from the true function over the grid of <span class="math inline">(x)</span> values we are considering. Then we collect all these maxima, sort them and either take the 1 - <span class="math inline">()</span> probability quantile of the maxima, or the maximum with rank <span class="math inline">((1 - ) / N )</span>.
</p>
<p>
OK, thatās enough of words and crazy equations. Implementing this in R is going to be easier than those equations might suggest. Iāll run through the code we need line by line. First we define a simple function to generate random values from a multivariate normal: this is in the manual for <strong>mgcv</strong> and saves us loading another package just for this:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">rmvn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">sig</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1">## MVN random deviates</span><span class="w">
</span><span class="n">L</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mroot</span><span class="p">(</span><span class="n">sig</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">L</span><span class="p">)</span><span class="w">
</span><span class="n">t</span><span class="p">(</span><span class="n">mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">L</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="n">m</span><span class="o">*</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
Next we extract a few things that we need from the fitted GAM
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">Vb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vcov</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">fossil</span><span class="p">,</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">age</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">age</span><span class="p">),</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">se.fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">se.fit</span></code></pre>
</figure>
<p>
The first is the Bayesian covariance matrix of the model coefficients, <span class="math inline">()</span>. This <span class="math inline">()</span> is conditional upon the smoothing parameter(s). If you want a version that adjusts for the smoothing parameters being estimated rather than known values, add <code>unconditional = TRUE</code> to the <code>vcov()</code> call. Second, we define our grid of <span class="math inline">(x)</span> values over which we want a confidence band. Then we generate predictions and standard errors from the model for the grid of values. The last line just extracts out the standard errors of the fitted values for use later.
</p>
<p>
Now we are ready to generate simulations of the maximum absolute standardized deviation of the fitted model from the true model. We set the pseudo-random seed to make the results reproducible and specify the number of simulations to generate.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span></code></pre>
</figure>
<p>
Next, we want <code>N</code> draws from <span class="math inline">(
<span class="math display">\[\begin{bmatrix} \boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\ \mathbf{\hat{u}} - \mathbf{u} \\ \end{bmatrix}\]</span>
)</span>, which is approximately distributed multivariate normal with mean vector <span class="math inline">()</span> and covariance matrix <code>Vb</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">BUdiff</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rmvn</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">Vb</span><span class="p">)),</span><span class="w"> </span><span class="n">sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Vb</span><span class="p">)</span></code></pre>
</figure>
<p>
Now we calculate <span class="math inline">((x) - f(x))</span>, which is given by <span class="math inline">(
<span class="math display">\[\begin{bmatrix} \boldsymbol{\hat{\beta}} - \boldsymbol{\beta} \\ \mathbf{\hat{u}} - \mathbf{u} \\ \end{bmatrix}\]</span>
)</span> evaluated at the grid of <span class="math inline">(x)</span> values
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">Cg</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">simDev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Cg</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">BUdiff</span><span class="p">)</span></code></pre>
</figure>
<p>
The first line evaluates the basis function at <span class="math inline">()</span> and the second line computes the deviations between the fitted and true parameters. Then we find the absolute values of the standardized deviations from the true model. Here we do this in a single step for all simulations using <code>sweep()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">absDev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">sweep</span><span class="p">(</span><span class="n">simDev</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"/"</span><span class="p">))</span></code></pre>
</figure>
<p>
The maximum of the absolute standardized deviations at the grid of <span class="math inline">(x)</span> values for each simulation is computed via an <code>apply()</code> call
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">masd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">absDev</span><span class="p">,</span><span class="w"> </span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="p">)</span></code></pre>
</figure>
<p>
The last step is to find the critical value used to scale the standard errors to yield the simultaneous interval; here we calculate the critical value for a 95% simultaneous confidence interval/band
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">crit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">masd</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.95</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8</span><span class="p">)</span></code></pre>
</figure>
<p>
The critical value estimated above is 3.205. Intervals generated using this value will be 1.6 times larger than the point-wise interval shown above.
</p>
<p>
Now that we have the critical value, we can calculate the simultaneous confidence interval. In the code block below I first add the grid of values (<code>newd</code>) to the fitted values and standard errors at those new values and then augment this with upper and lower limits for a 95% simultaneous confidence interval (<code>uprS</code> and <code>lwrS</code>), as well as the usual 95% point-wise intervals for comparison (<code>uprP</code> and <code>lwrP</code>). Then I plot the two intervals:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">pred</span><span class="p">),</span><span class="w"> </span><span class="n">newd</span><span class="p">),</span><span class="w">
</span><span class="n">uprP</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">),</span><span class="w">
</span><span class="n">lwrP</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">),</span><span class="w">
</span><span class="n">uprS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">),</span><span class="w">
</span><span class="n">lwrS</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se.fit</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwrS</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">uprS</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwrP</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">uprP</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Strontium isotope ratio"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Age [Ma BP]"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-confidence-intervals-1.png" alt="Comparison of point-wise and simultaneous 95% confidence intervals for the fitted GAM" />
<figcaption>
Comparison of point-wise and simultaneous 95% confidence intervals for the fitted GAM
</figcaption>
</figure>
<p>
Finally, Iām going to look at the coverage properties of the interval we just created, which is something I should have done in the older post as it would have shown, as weāll see, that the old interval I wrote about wasnāt even close to having the correct coverage properties.
</p>
<p>
Start by drawing a large sample from the posterior distribution of the fitted model. Note that this time, weāre simulating from a multivariate normal with mean vector given by the estimated model coefficients
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">sims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rmvn</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Vb</span><span class="p">)</span><span class="w">
</span><span class="n">fits</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Cg</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">sims</span><span class="p">)</span></code></pre>
</figure>
<p>
<code>fits</code> now contains N = 10<sup>4</sup> draws from the model posterior. Before we look at how many of the 10<sup>4</sup> samples from the posterior are entirely contained within the simultaneous interval, choose 30 at random and stack them in so-called tidy form for use with <code>ggplot()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">nrnd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">30</span><span class="w">
</span><span class="n">rnd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">nrnd</span><span class="p">)</span><span class="w">
</span><span class="n">stackFits</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stack</span><span class="p">(</span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">fits</span><span class="p">[,</span><span class="w"> </span><span class="n">rnd</span><span class="p">]))</span><span class="w">
</span><span class="n">stackFits</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">stackFits</span><span class="p">,</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">newd</span><span class="o">$</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">rnd</span><span class="p">)))</span></code></pre>
</figure>
<p>
What weāve done in this post can be summarized in the figure below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fit</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwrS</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">uprS</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwrP</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">uprP</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stackFits</span><span class="p">,</span><span class="w"> </span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">values</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ind</span><span class="p">),</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.4</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"grey20"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Strontium isotope ratio"</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Age [Ma BP]"</span><span class="p">,</span><span class="w">
</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Point-wise & Simultaneous 95% confidence intervals for fitted GAM"</span><span class="p">,</span><span class="w">
</span><span class="n">subtitle</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sprintf</span><span class="p">(</span><span class="s2">"Each line is one of %i draws from the Bayesian posterior distribution of the model"</span><span class="p">,</span><span class="w"> </span><span class="n">nrnd</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-intervals-revisited-plot-intervals-and-posterior-draws-1.png" alt="Summary plot showing 30 random draws from the model posterior and approximate 95% simultaneous and point-wise confidence intervals for the the fitted GAM" />
<figcaption>
Summary plot showing 30 random draws from the model posterior and approximate 95% simultaneous and point-wise confidence intervals for the the fitted GAM
</figcaption>
</figure>
<p>
It shows the fitted model and the 95% simultaneous and point-wise confidence intervals, and is augmented with 30 draws from the posterior distribution of the GAM. As you can see, many of the lines lie outside the point-wise confidence interval. The situation is quite different with the simultaneous interval; only a couple of the posterior draws go outside of the 95% simultaneous interval, which is what weād expect for a 95% interval. So thatās encouraging!
</p>
<p>
As a final check weāll look at the proportion of all the posterior simulations that lie entirely within the simultaneous interval. To facilitate this we create a little wrapper function, <code>inCI()</code>, which returns <code>TRUE</code> if all the evaluation points <span class="math inline">()</span> lie within the stated interval and <code>FALSE</code> otherwise. This is then applied to each posterior simulation (column of <code>fits</code>) and we do this for the simultaneous intervals and the point-wise version. The final two lines work out what proportion of the posterior simulations lie within the two confidence intervals.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">inCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">upr</span><span class="p">,</span><span class="w"> </span><span class="n">lwr</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">all</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">lwr</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">upr</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">fitsInPCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">fits</span><span class="p">,</span><span class="w"> </span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="n">inCI</span><span class="p">,</span><span class="w"> </span><span class="n">upr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">uprP</span><span class="p">,</span><span class="w"> </span><span class="n">lwr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">lwrP</span><span class="p">)</span><span class="w">
</span><span class="n">fitsInSCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">fits</span><span class="p">,</span><span class="w"> </span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="n">inCI</span><span class="p">,</span><span class="w"> </span><span class="n">upr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">uprS</span><span class="p">,</span><span class="w"> </span><span class="n">lwr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">lwrS</span><span class="p">)</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">fitsInPCI</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fitsInPCI</span><span class="p">)</span><span class="w"> </span><span class="c1"># Point-wise</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">fitsInSCI</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fitsInSCI</span><span class="p">)</span><span class="w"> </span><span class="c1"># Simultaneous</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0.3028
[1] 0.9526</code></pre>
</figure>
<p>
As you can see, the point-wise confidence interval includes just a small proportion of the posterior simulations, but the simultaneous interval contains approximately the right number of simulations for a 95% interval.
</p>
<p>
So how bad are the intervals I created in the old post? They should as bad as the 95% point-wise interval, and they are
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">oldCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">fits</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">,</span><span class="w"> </span><span class="n">quantile</span><span class="p">,</span><span class="w"> </span><span class="n">probs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">lwrOld</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">oldCI</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">uprOld</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">oldCI</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w">
</span><span class="n">fitsInOldCI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">fits</span><span class="p">,</span><span class="w"> </span><span class="m">2L</span><span class="p">,</span><span class="w"> </span><span class="n">inCI</span><span class="p">,</span><span class="w"> </span><span class="n">upr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">uprOld</span><span class="p">,</span><span class="w"> </span><span class="n">lwr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">lwrOld</span><span class="p">)</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">fitsInOldCI</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fitsInOldCI</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 0.2655</code></pre>
</figure>
<p>
So, there we have it ā a proper 95% simultaneous confidence interval for a penalised spline. Now I just need to go back to that old post and strike out all reference to <em>simultaneous</em>ā¦
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Marra2012-bq">
<p>
Marra, G., and Wood, S. N. (2012). Coverage properties of confidence intervals for generalized additive model components. <em>Scandinavian journal of statistics, theory and applications</em> 39, 53ā74. doi:<a href="https://doi.org/10.1111/j.1467-9469.2011.00760.x">10.1111/j.1467-9469.2011.00760.x</a>.
</p>
</div>
<div id="ref-Nychka1988-rz">
<p>
Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. <em>Journal of the American Statistical Association</em> 83, 1134ā1143. doi:<a href="https://doi.org/10.1080/01621459.1988.10478711">10.1080/01621459.1988.10478711</a>.
</p>
</div>
<div id="ref-Ruppert2003-pt">
<p>
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). <em>Semiparametric regression</em>. Cambridge University Press.
</p>
</div>
</div>
ISEC 2016 Talk
Gavin L. Simpson
2016-07-02T00:00:00-06:00
2016-07-02T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/07/02/isec-2016-talk/
<p>
My ISEC 2016 talk, <em>Estimating temporal change in mean and variance of community composition via location, scale additive models</em>, describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores.
</p>
<p>
My ISEC 2016 talk, <em>Estimating temporal change in mean and variance of community composition via location, scale additive models</em>, describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores.
</p>
<p>
Using data from two varved lakes
</p>
<ul>
<li>
Lake 227, Experimental Lakes Area, Ontario, Canada, and
</li>
<li>
Baldeggersee, Switzerland,
</li>
</ul>
<p>
I use location scale generalised additive models to simultaneously model the mean (trend) and the variance of time series of fossil algal pigments and diatom counts.
</p>
<p>
These techniques may be applied to data from less ideal situations, where observations are irregularly sampled in time and have varying sample intervals/effects of time averaging.
</p>
<p>
The slide deck can be downloaded from <a href="https://10.6084/m9.figshare.3470144.v1">Figshare</a>.
</p>
<div style="margin-left: auto; margin-right: auto; width: 700px; height: 716px;">
<p>
<iframe src="https://widgets.figshare.com/articles/3470144/embed?show_title=1" width="700" height="716" frameborder="0">
</iframe>
</p>
</div>
Rootograms
Gavin L. Simpson
2016-06-07T00:00:00-06:00
2016-06-07T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/06/07/rootograms/
<p>
Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across <span class="citation" data-cites="Kleiber2016-pt">Kleiber and Zeileis (2016)</span> who propose the <em>rootogram</em> as an improved approach to the assessment of fit of a count regression model. <a href="http://arxiv.org/abs/1605.01311">The paper</a> is illustrated using R and the authorsā <strong>countreg</strong> package (currently on R-Forge only). Here, I thought Iād take a quick look at the rootogram with some simulated species abundance data.
</p>
<div id="refs" class="references">
<div id="ref-Kleiber2016-pt">
<p>
Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms. Available at: <a href="http://arxiv.org/abs/1605.01311">http://arxiv.org/abs/1605.01311</a>.
</p>
</div>
</div>
<p>
Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across <span class="citation" data-cites="Kleiber2016-pt">Kleiber and Zeileis (2016)</span> who propose the <em>rootogram</em> as an improved approach to the assessment of fit of a count regression model. <a href="http://arxiv.org/abs/1605.01311">The paper</a> is illustrated using R and the authorsā <strong>countreg</strong> package (currently on R-Forge only). Here, I thought Iād take a quick look at the rootogram with some simulated species abundance data.
</p>
<p>
Start by simulating some data to work with. Here I use my <strong>coenocliner</strong> package, and simulate three data sets, each of which uses the same environmental gradient, but with counts drawn from the following distributions
</p>
<ol type="1">
<li>
Poisson
</li>
<li>
Negative binomial
</li>
<li>
Zero-inflated negative binomial
</li>
</ol>
<p>
To follow along here youāll need the latest version of <em>coenocliner</em> from CRAN (>= 0.2-2) as a bit of a bug entered into my code when changing between parameterizations of the negative binomial.
</p>
<p>
Load <em>coenocliner</em> and set up
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"coenocliner"</span><span class="p">)</span><span class="w">
</span><span class="c1">## parameters for simulating</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">locs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="c1"># environmental locations</span><span class="w">
</span><span class="n">A0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">90</span><span class="w"> </span><span class="c1"># maximal abundance</span><span class="w">
</span><span class="n">mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="c1"># position on gradient of optima</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1.5</span><span class="w"> </span><span class="c1"># parameter of beta response</span><span class="w">
</span><span class="n">gamma</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="c1"># parameter of beta response</span><span class="w">
</span><span class="n">r</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">6</span><span class="w"> </span><span class="c1"># range on gradient species is present</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">A0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A0</span><span class="p">)</span><span class="w">
</span><span class="n">nb.alpha</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1.5</span><span class="w"> </span><span class="c1"># overdispersion parameter 1/theta</span><span class="w">
</span><span class="n">zprobs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0.3</span><span class="w"> </span><span class="c1"># prob(y == 0) in binomial model</span></code></pre>
</figure>
<p>
Now we can simulate counts for the 100 locations along the gradient for each of the three count models
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pois</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"beta"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">countModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"poisson"</span><span class="p">)</span><span class="w">
</span><span class="n">nb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"beta"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">countModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"negbin"</span><span class="p">,</span><span class="w">
</span><span class="n">countParams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nb.alpha</span><span class="p">))</span><span class="w">
</span><span class="n">zinb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"beta"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">countModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ZINB"</span><span class="p">,</span><span class="w">
</span><span class="n">countParams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nb.alpha</span><span class="p">,</span><span class="w"> </span><span class="n">zprobs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">zprobs</span><span class="p">))</span></code></pre>
</figure>
<p>
and combine them into a data frame with the gradient locations
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">cbind.data.frame</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">pois</span><span class="p">,</span><span class="w"> </span><span class="n">nb</span><span class="p">,</span><span class="w"> </span><span class="n">zinb</span><span class="p">),</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="s2">"x"</span><span class="p">,</span><span class="w"> </span><span class="s2">"yPois"</span><span class="p">,</span><span class="w"> </span><span class="s2">"yNegBin"</span><span class="p">,</span><span class="w"> </span><span class="s2">"yZINB"</span><span class="p">))</span></code></pre>
</figure>
<p>
To each of these Iām going to fit a Poisson GLM to show how rootograms can facilitate model evaluation where we know what the underlying model is so we can see what might happen when the wrong model, in this case a Poisson GLM, is fitted to data
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">glm.pois</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">yPois</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">)</span><span class="w">
</span><span class="n">glm.nb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">yNegBin</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">)</span><span class="w">
</span><span class="n">glm.zinb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">yZINB</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">poisson</span><span class="p">)</span></code></pre>
</figure>
<p>
In each case, a Poisson GLM was fitted even though we knew that for <code>yNegBin</code> and <code>yZINB</code> that the data generating process was not the Poisson.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>
</p>
<p>
Next, generate rootograms for each of these models. I start by loading the <em>countreg</em> package as well as <strong>ggplot2</strong>, as Iāll plot the rootograms using the latter rather than base graphics.
</p>
<p>
Load the <em>countreg</em> package and <em>ggplot</em>. If you donāt have <em>countreg</em> installed, install it from R Forge using <code>install.packages(ācountregā, repos=āhttp://R-Forge.R-project.orgā)</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"countreg"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span></code></pre>
</figure>
<p>
Rootograms are calculated using the <code>rootogram()</code> function. You can provide the observed and expected (given the model) counts as arguments to <code>rootogram()</code> or, most usefully for our purposes, a fitted count model object from which the relevant values will be extracted. <code>rootogram()</code> knows about <code>glm</code>, <code>gam</code>, <code>gamlss</code>, <code>hurdle</code>, and <code>zeroinfl</code> objects at the time of writing.
</p>
<p>
Three different kinds of rootograms are discussed in the paper
</p>
<ol type="1">
<li>
Standing,
</li>
<li>
Hanging, and
</li>
<li>
Suspended.
</li>
</ol>
<p>
<span class="citation" data-cites="Kleiber2016-pt">Kleiber and Zeileis (2016)</span> recommend <em>hanging</em> or <em>suspended</em> rootograms, for reasons Iāll mention shortly. Which type of rootogram is produced is controlled via argument <code>style</code>. The final option I use below is <code>plot = FALSE</code>, which suppresses plotting of the rootogram as I want to do that later using <em>ggplot</em>.
</p>
<p>
Generate the three rootograms
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">root.pois</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm.pois</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hanging"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">root.nb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm.nb</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hanging"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">root.zinb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm.zinb</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hanging"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<p>
and gather them into an object for plotting ā notice Iām using the <code>autoplot()</code> method to generate <em>ggplot2</em> plot objects, and adjusting the limits to make the plots comparable. The resulting figure is shown below the code
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ylims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ylim</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">7</span><span class="p">)</span><span class="w"> </span><span class="c1"># common scale for comparison</span><span class="w">
</span><span class="n">plot_grid</span><span class="p">(</span><span class="n">autoplot</span><span class="p">(</span><span class="n">root.pois</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">autoplot</span><span class="p">(</span><span class="n">root.nb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w">
</span><span class="n">autoplot</span><span class="p">(</span><span class="n">root.zinb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"auto"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/rootograms-plot-rootograms-1.png" alt="Hanging rootograms for a Poisson GLM fitted to simulated Poisson (a), negative binomial (b), and zero-inflated negative binomial (c) count data" />
<figcaption>
Hanging rootograms for a Poisson GLM fitted to simulated Poisson (a), negative binomial (b), and zero-inflated negative binomial (c) count data
</figcaption>
</figure>
<p>
Looking first at panel <strong>a</strong> we see the main features of the rootogram:
</p>
<ul>
<li>
<em>expected</em> counts, given the model, are shown by the thick red line,
</li>
<li>
<em>observed</em> counts are shown as bars, which in a <em>hanging</em> rootogram are show hanging from the red line of expected counts,
</li>
<li>
on the <em>x</em>-axis we have the count bin, 0 count, 1 count, 2 count, etc,
</li>
<li>
on the <em>y</em>-axis we have the square root of the observed or expected count ā the square root transformation allows for departures from expectations to be seen even at small frequencies
</li>
<li>
A reference line is drawn at a height of 0
</li>
</ul>
<p>
Because this is a <em>hanging</em> rootogram, we can think of the rootogram as relating to the <em>fitted</em> counts ā if a bar doesnāt reach the zero line then the model <em>over predicts</em> a particular count bin, and if the bar exceeds the zero line it <em>under predicts</em>.
</p>
<p>
For the Poisson GLM fitted to counts generated from a Poisson distribution (panel a) we see general good agreement between the expected and observed counts, with a small amount of under prediction of some counts between 10ā20. For the Poisson GLM fitted to the data generated from a negative binomial distribution (panel b) we see a much poorer fit ā the zero count is under predicted whilst some low counts are over predicted, and a large number of count bins are under predicted between 4 and 10 counts. Focusing on the bottom of the bars we see an undulating pattern with runs either above or below the zero reference line, highlighting a general lack of fit in the model.
</p>
<p>
The fit of the Poisson GLM to data generated using a ZINB also shows considerable model lack of fit; strong under prediction of the zero bin and over prediction of the 1 count bin, with perhaps some general over prediction across most bins.
</p>
<p>
It is useful to compare rootograms showing the fits for incorrect and correct models side by side. To that end next I fit a negative binomial GLM and a ZINB using the <code>glm.nb()</code> function from package <strong>MASS</strong> and the <code>zeroinfl()</code> function from package <em>countreg</em> respectively, and create the relevant rootograms
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"MASS"</span><span class="p">)</span><span class="w">
</span><span class="n">glm2.nb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm.nb</span><span class="p">(</span><span class="n">yNegBin</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">)</span><span class="w">
</span><span class="n">glm2.zinb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">zeroinfl</span><span class="p">(</span><span class="n">yZINB</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">poly</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">dist</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"negbin"</span><span class="p">)</span><span class="w">
</span><span class="c1">## create rootograms</span><span class="w">
</span><span class="n">root2.nb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm2.nb</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hanging"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">root2.zinb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm2.zinb</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"hanging"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<p>
First, we look at the negative binomial data and compare rootograms of the Poisson and negative binomial model fits
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot_grid</span><span class="p">(</span><span class="n">autoplot</span><span class="p">(</span><span class="n">root.nb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">autoplot</span><span class="p">(</span><span class="n">root2.nb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"auto"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/rootograms-plot-negbin-rootos-1.png" alt="Hanging rootograms for Poisson GLM (a) and negative binomial model (b) fits to the simulated negative binomial count data" />
<figcaption>
Hanging rootograms for Poisson GLM (a) and negative binomial model (b) fits to the simulated negative binomial count data
</figcaption>
</figure>
<p>
The rootogram for the negative binomial GLM fit (panel b) shows much better agreement with the data than that of the Poisson fit (panel a). Departures from expected counts are much smaller and the zero-count bin is much better fitted. Some small deviations from the observed data remain but that is to be expected.
</p>
<p>
Next we compare rootograms for the fits of the Poisson GLM and ZINB model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ylims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ylim</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">8.5</span><span class="p">)</span><span class="w">
</span><span class="n">plot_grid</span><span class="p">(</span><span class="n">autoplot</span><span class="p">(</span><span class="n">root.zinb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">autoplot</span><span class="p">(</span><span class="n">root2.zinb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"auto"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/rootograms-plot-zinb-rootos-1.png" alt="Hanging rootograms for Poisson GLM (a) and zero-inflated negative binomial model (b) fits to the simulated zero-inflated negative binomial count data" />
<figcaption>
Hanging rootograms for Poisson GLM (a) and zero-inflated negative binomial model (b) fits to the simulated zero-inflated negative binomial count data
</figcaption>
</figure>
<p>
The rootogram for the ZINB model (panel b) shows better agreement with the zero-count bin than the Poisson model (panel a), though fits for the remaining count bins are similar to one another in both models. In particular, the ZINB model is still over predicting single counts.
</p>
<p>
Suspended rootograms are also recommended by <span class="citation" data-cites="Kleiber2016-pt">Kleiber and Zeileis (2016)</span>. These rootograms show the <em>difference</em> between observed and expected counts, with bars hanging from the zero-line rather than the expected count line. Therefore we can think of this rootogram as showing information about the model residuals rather than the fitted values of the hanging rootogram. A suspended rootogram is produced using <code>style = āsuspendedā</code> and an example, for the ZINB model, is shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">autoplot</span><span class="p">(</span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm2.zinb</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"suspended"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/rootograms-plot-zinb-suspended-1.png" alt="Suspended rootogram for a zero-inflated negative binomial model fitted to the simulated zero-inflated negative binomial count data" />
<figcaption>
Suspended rootogram for a zero-inflated negative binomial model fitted to the simulated zero-inflated negative binomial count data
</figcaption>
</figure>
<p>
Standing histograms are not recommended by <span class="citation" data-cites="Kleiber2016-pt">Kleiber and Zeileis (2016)</span> as they simply show the expected and observed counts and the user then has to compare the height of each bar with the expected curve for each bin. By tying the bars to the expected curve or zero reference line in hanging or suspended rootograms, the assessment of fit is made by comparison of deviations from the reference line rather than bin-by-bin comparison of observed and expected counts. A standing rootogram, for completeness, is shown below for the ZINB model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">autoplot</span><span class="p">(</span><span class="n">rootogram</span><span class="p">(</span><span class="n">glm2.zinb</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"standing"</span><span class="p">,</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/rootograms-plot-zinb-standing-1.png" alt="Standing rootogram for a zero-inflated negative binomial model fitted to the simulated zero-inflated negative binomial count data" />
<figcaption>
Standing rootogram for a zero-inflated negative binomial model fitted to the simulated zero-inflated negative binomial count data
</figcaption>
</figure>
<p>
A neat feature of the <em>countreg</em> package is that rootograms can be combined using the <code>c()</code> or <code>cbind()</code> methods, which makes plotting multiple rootograms much more simple than I showed above. For example, to compare the Poisson and negative binomial model fits to the negative binomial counts one could have used
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">autoplot</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">root.nb</span><span class="p">,</span><span class="w"> </span><span class="n">root2.nb</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/rootograms-plot-concatenated-rootograms-1.png" alt="Result of plotting two rootograms that were combined using cbind()" />
<figcaption>
Result of plotting two rootograms that were combined using <code>cbind()</code>
</figcaption>
</figure>
<p>
So, there we go; these are rootograms and they seem like a pretty useful tool for assessing fits of count models. I really recommend having a look at <span class="citation" data-cites="Kleiber2016-pt">Kleiber and Zeileis (2016)</span> as it contains much more discussion and illustration of the proposed rootograms than I could possibly include here. They also have a nice ecological example of data from an investigation into horseshoe crab mating plus two other examples. Their paper will shortly appear in the journal <em>The American Statistician</em>, although at the time of writing I donāt have citation details for that version of the paper.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Kleiber2016-pt">
<p>
Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms. Available at: <a href="http://arxiv.org/abs/1605.01311">http://arxiv.org/abs/1605.01311</a>.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
Iām kind of glossing over the fact that a quadratic function of <em>x</em> is not really the true model here, which is a generalised beta response function. This kind of sets up a follow-up post using a GAM fitā¦<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Harvesting more Canadian climate data
Gavin L. Simpson
2016-05-24T00:00:00-06:00
2016-05-24T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/05/24/harvesting-more-canadian-climate-data/
<p>
A <a href="/2015/01/14/harvesting-canadian-climate-data/">while back I wrote</a> some code to download climate data from Government of Canadaās historical climate/weather data website for one of our students. In May this year (2016) the Government of Canada changed their website a little and the API code that responded to requests had changed URL and some of the GET parameters had also changed. In fixing those functions I also noted that the original code only downloaded hourly data and not all useful weather variables are recorded hourly; precipitation for example is only in the daily and monthly data formats. This post updates the earlier one, explaining what changed and how the code has been updated. As an added benefit, the functions can now handle downloading daily and monthly data files as well as the hourly files that the original could handle.
</p>
<p>
A <a href="/2015/01/14/harvesting-canadian-climate-data/">while back I wrote</a> some code to download climate data from Government of Canadaās historical climate/weather data website for one of our students. In May this year (2016) the Government of Canada changed their website a little and the API code that responded to requests had changed URL and some of the GET parameters had also changed. In fixing those functions I also noted that the original code only downloaded hourly data and not all useful weather variables are recorded hourly; precipitation for example is only in the daily and monthly data formats. This post updates the earlier one, explaining what changed and how the code has been updated. As an added benefit, the functions can now handle downloading daily and monthly data files as well as the hourly files that the original could handle.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/screenshot-gc-climate-website.jpeg" title="Screenshot of Government of Canada's climate website" alt="Screenshot of Government of Canadaās climate website" />
<figcaption>
Screenshot of Government of Canadaās climate website
</figcaption>
</figure>
<p>
The <code>genURLS()</code> function now has an extra argument <code>timeframe</code> which allows you to select which type of data to download, defaulting to hourly data:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">genURLS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">,</span><span class="w"> </span><span class="n">timeframe</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"hourly"</span><span class="p">,</span><span class="w"> </span><span class="s2">"daily"</span><span class="p">,</span><span class="w"> </span><span class="s2">"monthly"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">nyears</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">years</span><span class="p">)</span><span class="w">
</span><span class="n">timeframe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">match.arg</span><span class="p">(</span><span class="n">timeframe</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">all.equal</span><span class="p">(</span><span class="n">timeframe</span><span class="p">,</span><span class="w"> </span><span class="s2">"hourly"</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nyears</span><span class="p">)</span><span class="w">
</span><span class="n">ids</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">nyears</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">all.equal</span><span class="p">(</span><span class="n">timeframe</span><span class="p">,</span><span class="w"> </span><span class="s2">"daily"</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="c1"># this is essentially arbitrary & ignored if daily</span><span class="w">
</span><span class="n">ids</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">nyears</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">start</span><span class="w"> </span><span class="c1"># again arbitrary, for monthly it just gives you all data</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="c1"># and this is also ignored</span><span class="w">
</span><span class="n">ids</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">id</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">timeframe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">match</span><span class="p">(</span><span class="n">timeframe</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"hourly"</span><span class="p">,</span><span class="w"> </span><span class="s2">"daily"</span><span class="p">,</span><span class="w"> </span><span class="s2">"monthly"</span><span class="p">))</span><span class="w">
</span><span class="n">URLS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID="</span><span class="p">,</span><span class="w"> </span><span class="n">id</span><span class="p">,</span><span class="w">
</span><span class="s2">"&Year="</span><span class="p">,</span><span class="w"> </span><span class="n">years</span><span class="p">,</span><span class="w">
</span><span class="s2">"&Month="</span><span class="p">,</span><span class="w"> </span><span class="n">months</span><span class="p">,</span><span class="w">
</span><span class="s2">"&Day=14"</span><span class="p">,</span><span class="w">
</span><span class="s2">"&format=csv"</span><span class="p">,</span><span class="w">
</span><span class="s2">"&timeframe="</span><span class="p">,</span><span class="w"> </span><span class="n">timeframe</span><span class="p">,</span><span class="w">
</span><span class="s2">"&submit=%20Download+Data"</span><span class="c1">## need this stoopid thing as of 11-May-2016</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="n">urls</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">URLS</span><span class="p">,</span><span class="w"> </span><span class="n">ids</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ids</span><span class="p">,</span><span class="w"> </span><span class="n">years</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">months</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">months</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">URLS</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
If we wanted all the data for 2014 for the Regina RCS station then we could generate the URLs weād need to visit as follows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">regina</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">genURLS</span><span class="p">(</span><span class="m">28011</span><span class="p">,</span><span class="w"> </span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="m">2014</span><span class="p">)</span><span class="w">
</span><span class="nf">length</span><span class="p">(</span><span class="n">regina</span><span class="o">$</span><span class="n">urls</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">regina</span><span class="o">$</span><span class="n">urls</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 12
[1] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=1&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[2] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=2&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[3] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=3&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[4] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=4&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[5] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=5&Day=14&format=csv&timeframe=1&submit=%20Download+Data"
[6] "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?stationID=28011&Year=2014&Month=6&Day=14&format=csv&timeframe=1&submit=%20Download+Data"</code></pre>
</figure>
<p>
The function that downloads and reads in the data was <code>getData()</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">getData</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">stations</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="p">,</span><span class="w"> </span><span class="n">timeframe</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"hourly"</span><span class="p">,</span><span class="w"> </span><span class="s2">"daily"</span><span class="p">,</span><span class="w"> </span><span class="s2">"monthly"</span><span class="p">),</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">delete</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">timeframe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">match.arg</span><span class="p">(</span><span class="n">timeframe</span><span class="p">)</span><span class="w">
</span><span class="c1">## form URLS</span><span class="w">
</span><span class="n">urls</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="nf">seq_len</span><span class="p">(</span><span class="n">NROW</span><span class="p">(</span><span class="n">stations</span><span class="p">)),</span><span class="w">
</span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">stations</span><span class="p">,</span><span class="w"> </span><span class="n">timeframe</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">genURLS</span><span class="p">(</span><span class="n">stations</span><span class="o">$</span><span class="n">StationID</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w">
</span><span class="n">stations</span><span class="o">$</span><span class="n">start</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w">
</span><span class="n">stations</span><span class="o">$</span><span class="n">end</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">timeframe</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timeframe</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="n">stations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stations</span><span class="p">,</span><span class="w"> </span><span class="n">timeframe</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">timeframe</span><span class="p">)</span><span class="w">
</span><span class="c1">## check the folder exists and try to create it if not</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">folder</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">warning</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"Directory:"</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="p">,</span><span class="w">
</span><span class="s2">"doesn't exist. Will create it"</span><span class="p">))</span><span class="w">
</span><span class="n">fc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">dir.create</span><span class="p">(</span><span class="n">folder</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">fc</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"Failed to create directory '"</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="p">,</span><span class="w">
</span><span class="s2">"'. Check path and permissions."</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Extract the data from the URLs generation</span><span class="w">
</span><span class="n">URLS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"urls"</span><span class="p">))</span><span class="w">
</span><span class="n">sites</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"ids"</span><span class="p">))</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"years"</span><span class="p">))</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"months"</span><span class="p">))</span><span class="w">
</span><span class="c1">## filenames to use to save the data</span><span class="w">
</span><span class="n">fnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">sites</span><span class="p">,</span><span class="w"> </span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">months</span><span class="p">,</span><span class="w"> </span><span class="s2">"data.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"-"</span><span class="p">)</span><span class="w">
</span><span class="n">fnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="n">folder</span><span class="p">,</span><span class="w"> </span><span class="n">fnames</span><span class="p">)</span><span class="w">
</span><span class="n">nfiles</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fnames</span><span class="p">)</span><span class="w">
</span><span class="c1">## set up a progress bar if being verbose</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">txtProgressBar</span><span class="p">(</span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nfiles</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="nf">on.exit</span><span class="p">(</span><span class="n">close</span><span class="p">(</span><span class="n">pb</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">out</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nfiles</span><span class="p">)</span><span class="w">
</span><span class="n">hourlyNames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Date/Time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">,</span><span class="s2">"Day"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Data Quality"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Temp Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dew Point Temp (degC)"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Dew Point Temp Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Rel Hum (%)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Rel Hum Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Wind Dir (10s deg)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Wind Dir Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Wind Spd (km/h)"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Wind Spd Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Visibility (km)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Visibility Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Stn Press (kPa)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Stn Press Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Hmdx"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Hmdx Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Wind Chill"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Wind Chill Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Weather"</span><span class="p">)</span><span class="w">
</span><span class="n">dailyNames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Date/Time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Day"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Data Quality"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Max Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Max Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Min Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Min Temp Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mean Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mean Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Heat Deg Days (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Heat Deg Days Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cool Deg Days (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Cool Deg Days Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Total Rain (mm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Rain Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Snow (cm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Snow Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Total Precip (mm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Precip Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Snow on Grnd (cm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Snow on Grnd Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Dir of Max Gust (10s deg)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dir of Max Gust Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Spd of Max Gust (10s deg)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Spd of Max Gust Flag"</span><span class="p">)</span><span class="w">
</span><span class="n">monthlyNames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Date/Time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Mean Max Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mean Max Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Mean Min Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mean Min Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Mean Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Mean Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Extr Max Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Extr Max Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Extr Min Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Extr Min Temp Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Total Rain (mm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Rain Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Total Snow (cm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Snow Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Total Precip (mm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Total Precip Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Snow Grnd Last Day (cm)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Snow Grnd Last Day Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Dir of Max Gust (10s deg)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dir of Max Gust Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Spd of Max Gust (10s deg)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Spd of Max Gust Flag"</span><span class="p">)</span><span class="w">
</span><span class="n">cnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">switch</span><span class="p">(</span><span class="n">timeframe</span><span class="p">,</span><span class="w"> </span><span class="n">hourly</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">hourlyNames</span><span class="p">,</span><span class="w"> </span><span class="n">daily</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dailyNames</span><span class="p">,</span><span class="w"> </span><span class="n">monthly</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">monthlyNames</span><span class="p">)</span><span class="w">
</span><span class="n">TIMEFRAME</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">match</span><span class="p">(</span><span class="n">timeframe</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"hourly"</span><span class="p">,</span><span class="w"> </span><span class="s2">"daily"</span><span class="p">,</span><span class="w"> </span><span class="s2">"monthly"</span><span class="p">))</span><span class="w">
</span><span class="n">SKIP</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="m">25</span><span class="p">,</span><span class="w"> </span><span class="m">18</span><span class="p">)[</span><span class="n">TIMEFRAME</span><span class="p">]</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nfiles</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">curfile</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fnames</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="c1">## Have we downloaded the file before?</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">curfile</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># No: download it</span><span class="w">
</span><span class="n">dload</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">download.file</span><span class="p">(</span><span class="n">URLS</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">curfile</span><span class="p">,</span><span class="w"> </span><span class="n">quiet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">dload</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># If problem, store failed URL...</span><span class="w">
</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">URLS</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="c1"># update progress bar...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">next</span><span class="w"> </span><span class="c1"># bail out of current iteration</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Must have downloaded, try to read file</span><span class="w">
</span><span class="c1">## skip first SKIP rows of header stuff</span><span class="w">
</span><span class="c1">## encoding must be latin1 or will fail - may still be problems with character set</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">read.csv</span><span class="p">(</span><span class="n">curfile</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SKIP</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"latin1"</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">## Did we have a problem reading the data?</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># yes handle read problem</span><span class="w">
</span><span class="c1">## try to fix the problem with dodgy characters</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readLines</span><span class="p">(</span><span class="n">curfile</span><span class="p">)</span><span class="w"> </span><span class="c1"># read all lines in file</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">iconv</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"latin1"</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"UTF-8"</span><span class="p">)</span><span class="w">
</span><span class="n">writeLines</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="n">curfile</span><span class="p">)</span><span class="w"> </span><span class="c1"># write the data back to the file</span><span class="w">
</span><span class="c1">## try to read the file again, if still an error, bail out</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">read.csv</span><span class="p">(</span><span class="n">curfile</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SKIP</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"UTF-8"</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">),</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># yes, still!, handle read problem</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">delete</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">file.remove</span><span class="p">(</span><span class="n">curfile</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove file if a problem & deleting</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">URLS</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="c1"># record failed URL...</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="c1"># update progress bar...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">next</span><span class="w"> </span><span class="c1"># bail out of current iteration</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Must have (eventually) read file OK, add station data</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind.data.frame</span><span class="p">(</span><span class="n">StationID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">sites</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">NROW</span><span class="p">(</span><span class="n">cdata</span><span class="p">)),</span><span class="w">
</span><span class="n">cdata</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">cdata</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cnames</span><span class="w">
</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cdata</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># Update the progress bar</span><span class="w">
</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">out</span><span class="w"> </span><span class="c1"># return</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
The main infelicity is that you have to supply the <code>getData()</code> with a data frame containing the station IDs and start and end years respectively for the data you want to collect. This suited my needs as we wanted to grab data from 10 stations with different start and end years as required to track station movements. Itās not as convenient if you only want to grab the data for a single station, however.
</p>
<p>
<code>getData()</code> gains the same <code>timeframe</code> argument as <code>genURLS()</code>. In addition, to handle the, quite frankly odd!, choice of characters used in the various flag columns, I now do conversion of the file encoding from <code>latin1</code> to <code>UTF-8</code> using the <code>iconv()</code> function. Whether this works portably or not remains to be seen ā Iām not that familiar with file encodings. If it doesnāt work, an option would be to determine what the userās locale is and from that change the encoding to the native encoding.
</p>
<p>
One thing youāll note quickly if you start downloading data using this function is that the web script the Government of Canada is using on their climate website will quite happily generate a fully-formed file containing no actual data (but with all the headers, hourly time stamps, etc) if you ask it for data outside the window of observations for a given station. There are no errors, just lots of mostly empty files, bar the header and labels.
</p>
<p>
One other thing to note is that <code>getData()</code> returns the downloaded data as a list and no attempt is made to flatten the individual components to a single large data frame. Thatās because it allows for any failed data downloads (or reads) and records the failed URL instead of the data. This gives you a chance to manually check those URLs to see what the problem might be before re-running the job, which because we saved all the CSVs will run very quickly from that local cache.
</p>
<p>
The use of <code>data.frame</code>s internally is showing signs of being a bit of a bottleneck performance-wise; <code>rbind()</code>-ing many stations or files of data takes a long time. I plan on changing the code to use <code>tbl_df</code>s now that Hadley has moved that functionality to the <strong>tibble</strong> package. I am reliably informed that <code>bind_rows()</code> is much quicker.
</p>
<p>
The eagle-eyed among you will notice the dreaded <code>stringsAsFactors = FALSE</code> in the definition of <code>getData()</code>. Iām beginning to see why people that work with messy data find the default <code>stringsAsFactors = TRUE</code> down right abhorrent!
</p>
<p>
To see <code>getData()</code> in action, weāll run a quick job, downloading the 2014 data for two stations
</p>
<ul>
<li>
Regina INTL A (51441)
</li>
<li>
Indian Head CDA (2925)
</li>
</ul>
<p>
First we create a data frame of station information
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">stations</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">StationID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">51441</span><span class="p">,</span><span class="w"> </span><span class="m">2925</span><span class="p">),</span><span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span></code></pre>
</figure>
<p>
Then we pass this to <code>getData()</code> with the path to the folder we wish to cache downloaded CSVs in
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">met</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">getData</span><span class="p">(</span><span class="n">stations</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"./csv"</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning in getData(stations, folder = "./csv", verbose = FALSE):
Directory: ./csv doesn't exist. Will create it</code></pre>
</figure>
<p>
This will take a few minutes to run, even for just 24 files, as the site is not the quickest to respond to requests (or perhaps they are now throttling my workstationās IP?). Note I turned off the printing of the progress bar here, only because this doesnāt play nicely with <strong>knitr</strong>ās capturing of the output. In real use, youāll want to leave the progress bar on (which it is by default) so you see how long you have to wait till the job is done.
</p>
<p>
Once this has finished, we can quickly determine if there were any failures
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="nf">any</span><span class="p">(</span><span class="n">failed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">met</span><span class="p">,</span><span class="w"> </span><span class="n">is.character</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] FALSE</code></pre>
</figure>
<p>
If any had failed, the <code>failed</code> logical vector could be used to index into <code>met</code> to extract the URLs that encountered problems, e.g.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">unlist</span><span class="p">(</span><span class="n">met</span><span class="p">[</span><span class="n">failed</span><span class="p">])</span></code></pre>
</figure>
<p>
If there were no problems, then the components of <code>met</code> can be bound into a data frame using <code>rbind()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">met</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="s2">"rbind"</span><span class="p">,</span><span class="w"> </span><span class="n">met</span><span class="p">)</span></code></pre>
</figure>
<p>
The data now looks like this
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">head</span><span class="p">(</span><span class="n">met</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> StationID Date/Time Year Month Day Time Data Quality Temp (degC)
1 51441 2014-01-01 00:00 2014 1 1 00:00 \u0087 -23.3
2 51441 2014-01-01 01:00 2014 1 1 01:00 \u0087 -23.1
3 51441 2014-01-01 02:00 2014 1 1 02:00 \u0087 -22.8
4 51441 2014-01-01 03:00 2014 1 1 03:00 \u0087 -23.3
5 51441 2014-01-01 04:00 2014 1 1 04:00 \u0087 -24.3
6 51441 2014-01-01 05:00 2014 1 1 05:00 \u0087 -24.3
Temp Flag Dew Point Temp (degC) Dew Point Temp Flag Rel Hum (%)
1 -26.3 77
2 -26.1 77
3 -25.8 77
4 -26.3 77
5 -27.1 78
6 -27.0 79
Rel Hum Flag Wind Dir (10s deg) Wind Dir Flag Wind Spd (km/h)
1 13 <NA> 22
2 12 <NA> 26
3 12 <NA> 22
4 13 <NA> 18
5 13 <NA> 14
6 9 <NA> 6
Wind Spd Flag Visibility (km) Visibility Flag Stn Press (kPa)
1 19.3 <NA> 95.38
2 24.1 <NA> 95.38
3 24.1 <NA> 95.39
4 24.1 <NA> 95.47
5 24.1 <NA> 95.56
6 24.1 <NA> 95.60
Stn Press Flag Hmdx Hmdx Flag Wind Chill Wind Chill Flag
1 NA NA -35 NA
2 NA NA -36 NA
3 NA NA -35 NA
4 NA NA -34 NA
5 NA NA -34 NA
6 NA NA -30 NA
Weather
1 Snow,Blowing Snow
2 Snow,Blowing Snow
3 Snow,Blowing Snow
4 Snow,Blowing Snow
5 Snow
6 <NA></code></pre>
</figure>
<p>
Yep, still a bit of a mess; some post processing is required if you want tidy <code>names</code> etc. The columns names are hardcoded but retain the messy names as given to them by the Government of Canadaās webmaster. Cleaning up afterwards is remains advised.
</p>
<p>
A final note, I could have run this over all the cores in my workstation or even on all the computers in my small computer cluster but I didnāt, instead choosing to run on a single core overnight to get the data we needed. Please be a good netizen if you do use the functions Iāve discussed here as other people will no doubt want to access the Government of Canadaās website. Donāt flood the site with requests!
</p>
<p>
If you have any suggestions for improvements or changes, let me know in the comments. The latest versions of the <code>genURLS()</code> and <code>getData()</code> functions can be found in this Github <a href="https://gist.github.com/gavinsimpson/8c13e3c5f905fd67cf85">gist</a>.
</p>
A new default plot for multivariate dispersions
Gavin L. Simpson
2016-04-17T00:00:00-06:00
2016-04-17T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/04/17/new-plot-default-for-betadisper/
<p>
This weekend, prompted by a pull request from Michael Friendly, I finally got round to improving the <code>plot</code> method for <code>betadisper()</code> in the <strong>vegan</strong> package. <code>betadisper()</code> is an implementation of Marti Andersonās <span class="smallcaps">Permdisp</span> method, a multivariate analogue of Leveneās test for homogeneity of variances. In improving the default plot and allowing customisation of plot features, I was reminded of how much I dislike programming plot functions that use base graphics. But donāt worry, this isnāt going to degenerate into a <strong>ggplot</strong> love-in nor a <a href="http://varianceexplained.org/r/why-I-use-ggplot2/">David Robinson-esque dig</a> at <a href="http://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/">Jeff Leek</a>.
</p>
<p>
This weekend, prompted by a pull request from Michael Friendly, I finally got round to improving the <code>plot</code> method for <code>betadisper()</code> in the <strong>vegan</strong> package. <code>betadisper()</code> is an implementation of Marti Andersonās <span class="smallcaps">Permdisp</span> method, a multivariate analogue of Leveneās test for homogeneity of variances. In improving the default plot and allowing customisation of plot features, I was reminded of how much I dislike programming plot functions that use base graphics. But donāt worry, this isnāt going to degenerate into a <strong>ggplot</strong> love-in nor a <a href="http://varianceexplained.org/r/why-I-use-ggplot2/">David Robinson-esque dig</a> at <a href="http://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/">Jeff Leek</a>.
</p>
<p>
The original <code>plot</code> method for <code>betadisper()</code> hardcoded all the linetypes, colours etc for features on the plot. I didnāt mind this on bit; ordination plots are difficult to programme, and, to get anything half-way publishable, the user will usually need to build a plot up from component parts using the low-level tools we provide. Also, itās kind of a theme in <strong>vegan</strong> to provide a useful, but not neccessarily pretty, default plot for our <code>plot</code> methods, whilst allowing for all manner of customisation via lower level methods like <code>points()</code> and <code>lines()</code>, plus custom tools such as <code>ordiellipse()</code> and <code>ordiarrows()</code>.
</p>
<p>
However, in practice users it seems arenāt always satisfied with this situation and expect default plots to be, well, <em>more</em>.
</p>
<p>
In its original incarnation, <code>plot.betadisper()</code> showed data points and group centroids embedded in a principal coordinates-derived Euclidean space, with convex hulls enclosing each groupās data points and line segments joining data points with their respective centroid. Centroids were in red, segments blue, and hulls black, all of which were hard-coded. More egregiously, the plot didnāt provide any indication of which group was which. I was OK with this as the principal coordinates plot was only really meant as a visualisation of what the method did; other plots and analyses that we provided in <strong>vegan</strong> were needed to assess significance of differences in dispersions etc.
</p>
<p>
There was nothing stopping me, however, from providing a more featureful version with full user control over the various aspects of the plot. Nothing that is except a deep reluctance to write ā in the first place ā and then subsequently maintain a function with a gabillion tortuously named arguments to differentiate the half dozen settings of <code>cex</code> <em>et al</em> for different features.
</p>
<p>
Thereās a real trade off between flexibility and complexity in <code>plot</code> methods like this. The situation is much easier to manage with lower-level functions to draw the individual features of the plot; invariably each lower-level tool requires a smaller subset of parameters, and if you code your function well, you can usually achieve all you need by passing <code>ā¦</code> on to the low-level base graphics functions your function uses. You canāt do this with a <code>plot</code> method that combines several lower-level features into a single plot; if you want to allow the user to independently control the colour of three separate plot features youāre going to need three different variations on the argument <code>col</code>. Multiply that by all the parameters you want to allow the user to tweak, and you have the recipe for a mess. Either that, or you need to accept lists of parameters for each feature, which arenāt exactly intuitive for casual users.
</p>
<p>
With the new <code>plot.betadisper()</code> method I took a compromise position, allowing some additional flexibility whilst limiting the argument bloat that is an unfortunate side effect of high-level base graphics <code>plot</code> methods.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## you'll need the development version of vegan from github for this</span><span class="w">
</span><span class="c1">## devtools::install_github("vegandevs/vegan")</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"vegan"</span><span class="p">)</span><span class="w">
</span><span class="n">args</span><span class="p">(</span><span class="n">vegan</span><span class="o">:::</span><span class="n">plot.betadisper</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">function (x, axes = c(1, 2), cex = 0.7, pch = seq_len(ng), col = NULL,
lty = "solid", lwd = 1, hull = TRUE, ellipse = FALSE, ellipse.type = c("sd",
"se"), ellipse.conf = NULL, segments = TRUE, seg.col = "grey",
seg.lty = lty, seg.lwd = lwd, label = TRUE, label.cex = 1,
ylab, xlab, main, sub, ...)
NULL</code></pre>
</figure>
<p>
Michael Friendly <a href="https://github.com/vegandevs/vegan/pull/165">supplied code</a> to allow some of the original plotting parameters to take vectors, one per group to facilitate their differentiation. I extended this to allow couple more standard parameters to be set by the user. Rather than have separate settings for convex hulls and confidence ellipses, both use the same general parameters. Only the line segments between data points and their centroid get any special treatment, in the main because they add quite of lot of components to the plot and being able to style them to sit in the background is quite useful.
</p>
<p>
Weāll look at the new plot using the main example in <code>?betadisper</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="p">(</span><span class="n">varespec</span><span class="p">)</span><span class="w"> </span><span class="c1"># load example data </span><span class="w">
</span><span class="n">dis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vegdist</span><span class="p">(</span><span class="n">varespec</span><span class="p">)</span><span class="w"> </span><span class="c1"># Bray-Curtis distances between samples</span><span class="w">
</span><span class="c1">## First 16 sites grazed, remaining 8 sites ungrazed</span><span class="w">
</span><span class="n">groups</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">16</span><span class="p">),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">8</span><span class="p">)),</span><span class="w"> </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"grazed"</span><span class="p">,</span><span class="s2">"ungrazed"</span><span class="p">))</span><span class="w">
</span><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">betadisper</span><span class="p">(</span><span class="n">dis</span><span class="p">,</span><span class="w"> </span><span class="n">groups</span><span class="p">)</span><span class="w"> </span><span class="c1"># Calculate multivariate dispersions</span></code></pre>
</figure>
<p>
Given <code>mod</code> the <code>plot</code> method produces a labelled plot with convex hulls and line segments
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/new-plot-default-for-betadisper-default-plot-1.png" alt="The new default plot produced by plot.betadisper()" />
<figcaption>
The new default plot produced by <code>plot.betadisper()</code>
</figcaption>
</figure>
<p>
Also at the <a href="https://github.com/vegandevs/vegan/issues/166">suggestion</a> of Michael Friendly, I added code to draw confidence ellipses, of which there are several flavours
</p>
<ul>
<li>
standard deviation ellipses
</li>
<li>
standard error ellipses
</li>
</ul>
<p>
with the default being to draw a 1 standard deviations ellipse (<code>ellipse.conf</code> controls how many standard deviations or errors are drawn, or which 1 - Ī± confidence ellipse is drawn.)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">hull</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">ellipse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/new-plot-default-for-betadisper-plot-with-confidence-intervals-1.png" alt="An alternate plot produced by plot.betadisper() showing 1 standard deviation ellipses about the group medians." />
<figcaption>
An alternate plot produced by <code>plot.betadisper()</code> showing 1 standard deviation ellipses about the group medians.
</figcaption>
</figure>
<p>
As a default plot, the new version is lot nicer and affords the user a reasonable level of flexibility to customise the plot without the number of arguments exploding uncontrollably. The code used to produce this is now a good deal more complex and because I grafted it on to the existing code it probably isnāt a clean or efficient as it could be.
</p>
<p>
The new function also reaffirms my dislike of providing high-level plot functions for a package that uses base graphics. As a means for producing plots, I like base graphics, for certain things. However, Iām also comfortable building plots up from low-level parts and can easily write code to quickly produce the plot I want. Clearly, from the emails and questions I receive, not all the users of <code>betadisper()</code> are so able or inclined. Providing a reasonable level of customisation to a higher level plot using base graphics is an exercise in tediousness and inelegance. It doesnāt look <em>nice</em> to add dozens of arguments just to enable the user to tweak a dozen tiny features of the plot. I also find it demotivating writing code like this and the accompanying documentation.
</p>
<p>
In this regard, <strong>ggplot</strong> is a much better system for producing customisable higher-level plots. All of the code for handling grouping, colours, line types etc is built into aesthetics and geoms, and a theme or customised palette or scale (such as the increasingly popular one supplied by the <strong>viridis</strong> package) allows a concise and principled way of changing the look and feel of a plot that tranfers across <em>all</em> plots created using <strong>ggplot</strong>. If you want to customise <code>plot.betadisper</code>ās output, you need to learn the half dozen particular arguments that I chose to implement. Yet once learned are these skills useful elsewhere? If youāre lucky, you can expect some semblance of consistency across a package, but beyond that, the user ends up having to learn the particulars of the plotting functions in each of the packages they end up using.
</p>
<p>
This is wasted effort and a considerable obstacle to overcome as a new R user. Itās taken me a while ā largely because on its own <strong>ggplot</strong> lacks features needed for every-day use by an academic ā to realise this, but Iām glad I have. If anything, whilst I am pleased with the changes made to <code>plot.betadisper()</code>, my resolve to spend more time working on <strong>ggvegan</strong> over the summer has strengthened as a direct result of writing this base graphics code.
</p>
<p>
I never expected to find myself writing thatā¦
</p>
LOESS revisited
Gavin L. Simpson
2016-04-10T00:00:00-06:00
2016-04-10T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/04/10/loess-revisited/
<p>
Itās fair to say I have gotten a bee<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> in my bonnet about how palaeolimnologists handle time. For a group of people for whom time is everything, we sure do a poor job (in general) of dealing with it in when it comes time to analyse our data. In many instances, āpoor jobā means making no attempt at all to account for the special nature of the time series. LOESS comes in for particular criticism because it is widely used by palaeolimnologists despite not being particularly suited to the task. Why this is so is perhaps due to itās promotion in influential books, papers, and software. I am far from innocent in this regard having taught LOESS and itās use for many years on the now-defunct ECRC Numerical Course. Here I want to look at further problems with our use of LOESS, and will argue that we need to resign it to the trash can for all but exploratory analyses. I will begin the case for the prosecution with one of my own transgressions.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
an entire hive is perhaps more apt!<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
<p>
Itās fair to say I have gotten a bee<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> in my bonnet about how palaeolimnologists handle time. For a group of people for whom time is everything, we sure do a poor job (in general) of dealing with it in when it comes time to analyse our data. In many instances, āpoor jobā means making no attempt at all to account for the special nature of the time series. LOESS comes in for particular criticism because it is widely used by palaeolimnologists despite not being particularly suited to the task. Why this is so is perhaps due to itās promotion in influential books, papers, and software. I am far from innocent in this regard having taught LOESS and itās use for many years on the now-defunct ECRC Numerical Course. Here I want to look at further problems with our use of LOESS, and will argue that we need to resign it to the trash can for all but exploratory analyses. I will begin the case for the prosecution with one of my own transgressions.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/dper-concensus-reconstruction.png" alt="Consensus reconstruction based upon the reconstructed pH values for the fossil samples in the Round Loch of Glenhead (RLGH3) core of all three reconstruction methods; one-component weighted averaging partial least squares model (WAPLS(1)), maximum likelihood (ML) and modern analogue technique (MAT)). The consensus reconstruction has been generated using a LOESS smoother fitted to the inferred pH values as a function of sample age with a span of 0.1. Reproduced from Figure 19.3 from Simpson and Hall (2012)." />
<figcaption>
Consensus reconstruction based upon the reconstructed pH values for the fossil samples in the Round Loch of Glenhead (RLGH3) core of all three reconstruction methods; one-component weighted averaging partial least squares model (WAPLS(1)), maximum likelihood (ML) and modern analogue technique (MAT)). The consensus reconstruction has been generated using a LOESS smoother fitted to the inferred pH values as a function of sample age with a span of 0.1. Reproduced from Figure 19.3 from <span class="citation" data-cites="Simpson2012-zi">Simpson and Hall (2012)</span>.
</figcaption>
</figure>
<p>
The figure above comes from one of the chapters I wrote in the infamous Numerical Methods book in the Developments in Paleoenvironmental Research series <span class="citation" data-cites="Simpson2012-zi">(Simpson and Hall, 2012)</span>. The aim here was to show the common pattern in reconstructed pH using three different calibration methods. In my defence, this was intended as an diagnostic plot but this may not have been clear in the text. Iād certainly be embarrassed if anyone took this usage to be any indication of how to go about hypothesis ātestingā on the trend in reconstructed pH.
</p>
<p>
There are two things wrong with this plot/usage
</p>
<ol type="1">
<li>
The usual problem; no justification for the span used (0.1)
</li>
<li>
Failure to account for between-method variation
</li>
</ol>
<p>
If you are doing an exploratory analysis, the choice of span is somewhat arbitrary; it doesnāt really matter what you use, and you might use several spans to get a feeling for potential features that may be present in the data. However, if you are planning on using the trends or features identified in this exploratory analysis to support some idea or hypothesis, then youāre going to get into a world of trouble.
</p>
<p>
First, thereās the potential for over-fitting. This is actually quite high with palaeo data as all but the most skeleton of sequences will have some amount of autocorrelation; <a href="/2012/07/24/whats-wrong-with-loess-for-palaeo-data/">something Iāve covered before</a>.
</p>
<p>
Second, just because you get a particular āfitā using this span it doesnāt mean the identified features in the smoother are significant. Can they be distinguished from the noisy background? Answering this question requires estimates of the uncertainty in the fitted function and calculation of the derivatives of the fitted smooth curve. The first derivative of the fitted smooth is equivalent to the slope (or coefficient) of a simple linear regression. In this model, we assess whether the estimated slope is consistent with the null hypothesis of a trend of <strong>0</strong> using a <em>t</em> statistic, which is the value of the slope estimate divided by its uncertainty (the standard error). Conceptually we can think of this as forming a 100(1 - Ī±) confidence interval and asking if 0 (the null hypothesis slope value) is included within this interval.
</p>
<p>
The equivalent for smoothers and splines is to compute the first derivative of the fitted smooth. Doing this analytically is often not straightforward, but we can use the method of finite differences to <a href="/2014/05/15/identifying-periods-of-change-with-gams/">approximate the first derivative of the smoother</a>. Using <a href="/2014/05/15/identifying-periods-of-change-with-gams/">standard errors of the derivative</a> or <a href="/2014/06/16/simultaneous-confidence-intervals-for-derivatives/">posterior simulation</a> we can compute confidence intervals on the derivatives and thus determine where along the curve there is sufficient evidence to reject the null hypothesis of no trend.
</p>
<p>
The linked posts explain this process and illustrate it using generalised additive models. The key point to remember though is that the model, even a LOESS one, fitted to the data, is <em>uncertain</em>; it contains a degree of uncertainty because weāve estimated things from the sample of data we happened to collect. As a result, it is inappropriate to simply interpret a fitted trend as is, without also considering the uncertainty in the estimation of the trend.
</p>
<p>
The other important thing that often gets overlooked is the <em>bias variance trade off</em>. If you fit a wiggly trend as compared to a smooth trend, all things being equal, the wiggly one will have lower bias and higher variance and the smooth one higher bias and lower variance. Here, by <em>variance</em> we mean <em>uncertainty</em>; change the data a bit and high variance fits will change a lot, hence the high uncertainty. With LOESS smoothers, low span values fit potentially high variance low bias models. Invariably these will be over-fitted, and highly uncertain, unless there is a lot of data from which to estimate such a wiggly trend and youāve properly accounted for the stochastic properties of the data such as any autocorrelation.
</p>
<p>
The other problem with the consensus reconstruction in the above figure is the failure to account for the between-method variance and the correlation between fitted values derived using the same calibration method. Such problems are commonly handled with a mixed effects model, but as we only have three āsubjectsā, that isnāt an option here. Ideally then, weād fit three separate trends, one per method, plus a separate mean for each method. Then we could compare this model with one that had a separate mean per method but just a single common trend. The key point to remember here is that the residuals should not contain much or any trace of a trend nor of which method was used. In the figure above this is clearly not the case and as a result it makes it difficult to do formal inference on the fitted smoother.
</p>
<p>
A more recent example of the latter point is <span class="citation" data-cites="Hobbs2016-mr">Hobbs et al.Ā (2016)</span>. Below, I reproduce figures 6 and 9 from the paper <span class="citation" data-cites="Hobbs2016-mr">(Hobbs et al., 2016)</span>
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/hobbs-et-al-figure-6.png" alt="DCA axis 1 scores for all 19 lakes with diatom paleoecological records. LOESS smooth curve for each park area shows the general trend of diatom community turnover through time. Shaded bars represent the timing of significant shifts in the diatom assemblages (details in the supporting information). Reproduced from Figure 6 from Hobbs et al. (2016)." />
<figcaption>
DCA axis 1 scores for all 19 lakes with diatom paleoecological records. LOESS smooth curve for each park area shows the general trend of diatom community turnover through time. Shaded bars represent the timing of significant shifts in the diatom assemblages (details in the supporting information). Reproduced from Figure 6 from <span class="citation" data-cites="Hobbs2016-mr">Hobbs et al.Ā (2016)</span>.
</figcaption>
</figure>
<p>
Here the problems of failing to account for core-specific trends are worse than my earlier example. In Figure 6, <span class="citation" data-cites="Hobbs2016-mr">Hobbs et al.Ā (2016)</span> show <acronym title="Detrended Correspondence Analysis">DCA</acronym> axis 1 scores for cores from parks around the great lakes, grouped at the park level such that each panel includes data from at least three different cores. The first problem here is that throughout the authors use LOESS but never state how they determined the span used in the figures. The second issue is that the reader canāt unpack the site-specific trends because the data for each site isnāt differentiated by plotting symbols or colour. Most importantly however, we see clear evidence that the LOESS trend is different to some or all of the trends or even the data, especially in the VOYA and SLBE panels. This is not so much showing a consensus but rewriting history entirely ā the fitted trend in some places doesnāt even go anywhere near the data! This is an ever-present problem with this kind of analysis.
</p>
<p>
Worse still is Figure 9 from the same paper <span class="citation" data-cites="Hobbs2016-mr">(Hobbs et al., 2016)</span>, shown below
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/hobbs-et-al-figure-9.png" alt="Sediment Ī“15N from all cores standardized as z scores. Loess smooth curve in red. Figure 9 from Hobbs et al. (2016)." />
<figcaption>
Sediment Ī“<sup>15</sup>N from all cores standardized as z scores. Loess smooth curve in red. Figure 9 from <span class="citation" data-cites="Hobbs2016-mr">Hobbs et al.Ā (2016)</span>.
</figcaption>
</figure>
<p>
This figure shows an impressive amount of Ī“<sup>15</sup>N values of bulk organic matter from many cores across the study region. Whilst it is clear if you look at the detail that there are many more lower Ī“<sup>15</sup>N values around the turn of the 20<sup>th</sup> century than before, the individual trends in Ī“<sup>15</sup>N are all obfuscated by the presentation. It is not clear what the LOESS smoother is showing at all; as it is a scatter plot smoother, it is showing pattern in the data points irrespective of grouping at the core level. As such we canāt expect that it is representative of a common trend at all; which is what the authors surely hoped it would do!
</p>
<p>
The z-score standardisation (centring and standardising each core to have zero mean and unit variance) used here also complicates the interpretation; the axis is no longer in Ī“<sup>15</sup>N values ā° but in standard deviation units from each core mean. By giving each core the same variance we actually gloss over differences in variance which might have ecological or environmental significance. It would be better to model these features explicitly.
</p>
<h2 id="a-solution">
A solution?
</h2>
<p>
Itās all well and good being critical of my work or that of othersā, but unless that critique comes with suggestions for ways to do better in the future, as a field we canāt progress. So, what could be done to provide a better analysis in both these cases? Two things in particular spring to mind
</p>
<ol type="1">
<li>
fit an explicit model that includes terms mapped to the features of the data, and
</li>
<li>
properly estimate the degree of smoothness in the data/trend
</li>
</ol>
<p>
Here on this blog Iāve discussed ways to handle point 2<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>, and I have some additional thoughts based on new types of smoothers and ideas from spatial statistics that happen to fit in with the GAM approach and spline bases. These ideas form the basis of a paper Iām writing at the moment.
</p>
<p>
Point 1 could be handled in a variety of ways;
</p>
<ul>
<li>
<p>
Fit a stochastic time series model using either maximum likelihood methods or Bayesian estimation. Such models include state-space formulations of the classic ARIMA-type models. Such models can account for site-specific effects, underlying latent trends that we have noisy observations from, and a proper accounting of the irregular sampling and change of support<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> inherent to most sediment core records.
</p>
</li>
<li>
<p>
For the consensus reconstruction example, a GAM with three separate trends or a common trend plus three separate departure trends would allow the explicit modelling of the features of interest. This is most easily achieved using <code>by</code> variable smooths in the <strong>mgcv</strong> package using <code>gam()</code>. If fitting a common trend and site specific departures from this common trend, the site specific departures need to be modelled using penalties on the first derivative (usually penalties are on the second derivative) to penalise departure from a flat function which represents no departure from the common trend.
</p>
</li>
<li>
<p>
For the Figure 6 example from <span class="citation" data-cites="Hobbs2016-mr">Hobbs et al.Ā (2016)</span>, there are enough cores to potentially model them as random effects, again as site specific trends or as common trend plus site-specific departures. The random effect splines are an efficient way of fitting many trends, and can be fitted using the factor-smooth interaction basis (<code>s(time, fac, bs = āfsā)</code> using <code>gam()</code> in <strong>mgcv</strong> for smooths of <code>time</code> for each level of factor <code>fac</code>) or via tensor product smooths combining a marginal smooth for <code>time</code> and a marginal random effect spline for each level of <code>fac</code>.
</p>
</li>
<li>
<p>
For the Figure 9 example, there are certainly enough cores to warrant a random effect spline approach as mentioned above.
</p>
</li>
</ul>
<p>
This blog post is already long-enough and I donāt have time to go into specific details of fitting random effect splines, by-variable splines, or splines based on ideas from kriging, here. In the next few months Iāll write up posts on these methods as both areas are being developed into manuscripts; the random effect spline methods is a collaboration with Eric Pedersen, David Miller, and Noam Ross.
</p>
<p>
In the consensus reconstruction and both the <span class="citation" data-cites="Hobbs2016-mr">Hobbs et al.Ā (2016)</span> examples a strong argument can be made for modelling a common trend plus site specific departures because in both cases interest is in trying to identify common trend and detailed site- or method-specific trends are of secondary concern.
</p>
<h2 id="whither-loess">
Whither LOESS?
</h2>
<p>
Where does this leave LOESS? I think it is clear that LOESS is perfectly acceptable as an <em>exploratory</em> method only. It makes few assumptions about the data and because the user needs to specify a span/bandwidth parameter it alows for interactive investigation of a range of potential temporal trends of varying smoothness. As a more formal method for fitting models with which one can actually answer scientific questions, LOESS is far less useful. This isnāt the fault of LOESS; it was designed as a scatterplot smoother, not for fitting multivariate time series models. The issue is rather in our reliance on LOESS without understanding or acknowledging its deficiencies for actual model fitting.
</p>
<p>
The problem of arbitrary choices of span parameters in LOESS can be worked around with a cross-validation procedure suited to handling temporally autocorrelated data. But the multivariate time series issues Iāve discussed in detail here are less easily solved. Itās not that they canāt be solved; the original GAM software used LOESS smooths as part of the formal GAM procedure. But this software doesnāt make it easy to fit common trend plus site-specific difference trends as would be required for both the examples discussed above. The <code>gam()</code> function from <strong>mgcv</strong> does allow this to be done with relative ease, hence this approach is something Iāve been exploring. The Bayesian approaches are probably our best solution long-term to modelling palaeoecological data because of their flexibility. But that flexibility comes at a price; complexity.
</p>
<p>
And that brings me to my final point. As a field, palaeolimnology really needs to take more seriously training in quantitative methods, and in particular modern methods such as the GAMs that Iāve found most useful and in Bayesian techniques in general. Where young palaeolimnologists get any training it is most often in the traditional methods that were adopted from a time before we had real computing power available to use and before Statistics, the science, had developed methods to really handle the sorts of data we were generating. We are currently going through a revolution in the development of methods for use with multivariate ecological data and complex time series data. Palaeolimnologists risk being left behind here and this worries me. A lot. I mainly worry because at best we are paying lip service to the deficiencies in the field in terms of our quantitative prowess. And itās is beginning to show in the quality of science we do and the ways we try to answer important ecological and environmental questions.
</p>
<p>
I find this troubling indeedā¦
</p>
<h2 id="references" class="unnumbered">
References
</h2>
<div id="refs" class="references">
<div id="ref-Hobbs2016-mr">
<p>
Hobbs, W. O., Lafrancois, B. M., Stottlemyer, R., Toczydlowski, D., Engstrom, D. R., Edlund, M. B., et al.Ā (2016). Nitrogen deposition to lakes in national parks of the western great lakes region: Isotopic signatures, watershed retention, and algal shifts. <em>Global biogeochemical cycles</em>, 2015GB005228. doi:<a href="https://doi.org/10.1002/2015GB005228">10.1002/2015GB005228</a>.
</p>
</div>
<div id="ref-Simpson2012-zi">
<p>
Simpson, G. L., and Hall, R. I. (2012). āHuman impacts: Applications of numerical methods to evaluate Surface-Water acidification and eutrophication,ā in <em>Tracking environmental change using lake sediments</em> Developments in paleoenvironmental research. (Springer Netherlands), 579ā614. doi:<a href="https://doi.org/10.1007/978-94-007-2745-8\_19">10.1007/978-94-007-2745-8_19</a>.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
an entire hive is perhaps more apt!<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">here</a>, <a href="/2011/07/21/smoothing-temporally-correlated-data/">here</a>, and <a href="/2016/03/25/additive-modeling-global-temperature-series-revisited/">here</a> for example<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn3">
<p>
If you think about what we record in our sediment samples, it is clear that this sequence is a highly modified version of the real per-unit time sedimentation that occurred in the lake. Hence we wish to make inference on something we havenāt actually observed directly. We can model the unobserved sequence as a latent trend polluted by noise. Because of compaction and bioturbation etc, each sediment slice represents different amounts of time. In other words each observation is supported by contributions from one or more unit-time observations from the unobserved latent trend. Samples from 100 years ago might be support by 4 years of observations from the latent process, but near the top of the core a single year from the latent process might be represented in each of the observations. This problem/feature is known as change of support.<a href="#fnref3" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Soap-film smoothers & lake bathymetries
Gavin L. Simpson
2016-03-27T00:00:00-06:00
2016-03-27T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/03/27/soap-film-smoothers/
<p>
A number of years ago, whilst I was still working at <a href="http://www.ensis.org.uk/">ENSIS</a>, the consultancy arm of the <a href="http://www.ensis.org.uk/">ECRC</a> at <a href="http://www.ucl.ac.uk">UCL</a>, I worked on a project for the (then) Countryside Council for Wales (CCW; now part of <a href="http://naturalresources.wales">Natural Resources Wales</a>). I donāt recall why they were doing this project, but we were tasked with producing a standardised set of bathymetric maps for Welsh lakes. The brief called for the bathymetries to be provided in standard GIS formats. Either CCWās project manager or the project lead at ENSIS had proposed to use <a href="https://en.wikipedia.org/wiki/Inverse_distance_weighting">inverse distance weighting</a> (IWD) to smooth the point bathymetric measurements. This probably stemmed from the person that initiatied our bathymetric programme at ENSIS being a GIS wizard, schooled in the ways of ArcGIS. My involvement was mainly data processing of the IDW results. I was however, at the time, also somewhat familiar with the problem of <em>finite area smoothing</em><a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> and had read a paper of Simon Woodās on his then new soap-film smoother <span class="citation" data-cites="Wood2008-gy">(Wood et al., 2008)</span>. So, as well as writing scripts to process and present the IDW-based bathymetry data in the report, I snuck a task into the work programme that allowed me to investigate using soap-film smoothers for modelling lake bathymetric data. The timing was never great to write up this method (two children and a move to Canada have occurred since the end of this project), so Iāve not done anything with the idea. Until nowā¦
</p>
<div id="refs" class="references">
<div id="ref-Wood2008-gy">
<p>
Wood, S. N., Bravington, M. V., and Hedley, S. L. (2008). Soap film smoothing. <em>Journal of the Royal Statistical Society. Series B, Statistical methodology</em> 70, 931ā955. doi:<a href="https://doi.org/10.1111/j.1467-9868.2008.00665.x">10.1111/j.1467-9868.2008.00665.x</a>.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
smoothing over a domain with known boundaries, like a lake<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
<p>
A number of years ago, whilst I was still working at <a href="http://www.ensis.org.uk/">ENSIS</a>, the consultancy arm of the <a href="http://www.ensis.org.uk/">ECRC</a> at <a href="http://www.ucl.ac.uk">UCL</a>, I worked on a project for the (then) Countryside Council for Wales (CCW; now part of <a href="http://naturalresources.wales">Natural Resources Wales</a>). I donāt recall why they were doing this project, but we were tasked with producing a standardised set of bathymetric maps for Welsh lakes. The brief called for the bathymetries to be provided in standard GIS formats. Either CCWās project manager or the project lead at ENSIS had proposed to use <a href="https://en.wikipedia.org/wiki/Inverse_distance_weighting">inverse distance weighting</a> (IWD) to smooth the point bathymetric measurements. This probably stemmed from the person that initiatied our bathymetric programme at ENSIS being a GIS wizard, schooled in the ways of ArcGIS. My involvement was mainly data processing of the IDW results. I was however, at the time, also somewhat familiar with the problem of <em>finite area smoothing</em><a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> and had read a paper of Simon Woodās on his then new soap-film smoother <span class="citation" data-cites="Wood2008-gy">(Wood et al., 2008)</span>. So, as well as writing scripts to process and present the IDW-based bathymetry data in the report, I snuck a task into the work programme that allowed me to investigate using soap-film smoothers for modelling lake bathymetric data. The timing was never great to write up this method (two children and a move to Canada have occurred since the end of this project), so Iāve not done anything with the idea. Until nowā¦
</p>
<p>
In this post, I want to introduce the concept of finite area smoothing and illustrate the use of soap-film smoothers in modelling lake bathymetric data.
</p>
<h2 id="finite-area-smoothing">
Finite area smoothing
</h2>
<p>
Often, we seek to model a response over a well-defined region with a known boundary. This problem is known as <em>finite area smoothing</em>, or as Ramsay put it, <em>smoothing over difficult regions</em> <span class="citation" data-cites="Ramsay2002-mv">(2002)</span>. Why this problem is more difficult than it sounds is well illustrated by the test function introduced by <span class="citation" data-cites="Ramsay2002-mv">Ramsay (2002)</span> a version of which is shown below<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">fsb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fs.boundary</span><span class="p">()</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">300</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">150</span><span class="w">
</span><span class="n">xm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="n">yn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">xx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">xm</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">yy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">yn</span><span class="p">,</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">tru</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">fs.test</span><span class="p">(</span><span class="n">xx</span><span class="p">,</span><span class="w"> </span><span class="n">yy</span><span class="p">),</span><span class="w"> </span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="c1">## truth</span><span class="w">
</span><span class="n">truth</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xx</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">yy</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.vector</span><span class="p">(</span><span class="n">tru</span><span class="p">))</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s2">"viridis"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">truth</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_raster</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_contour</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">),</span><span class="w"> </span><span class="n">binwidth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.data.frame</span><span class="p">(</span><span class="n">fsb</span><span class="p">),</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_viridis</span><span class="p">(</span><span class="n">na.value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"top"</span><span class="p">,</span><span class="w"> </span><span class="n">legend.key.width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="m">2.5</span><span class="p">,</span><span class="w"> </span><span class="s2">"cm"</span><span class="p">))</span><span class="w">
</span><span class="n">p</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-fs-boundary-figure-1.png" alt="Ramsayās test function" />
<figcaption>
Ramsayās test function
</figcaption>
</figure>
<p>
The domain of the test is a rotated U shape. Each stem of the U has quite different values of the response, achieved by smoothly varying the response along the U itself. Between the two stems is a barrier in the spatial domain. Smoothing across this barrier would bleed information from one side to the other, which would lead to poorly predicted values. One solution to the problem of smoothing inside domains such as the one shown is to smooth only considering distances between points within the domain, not distances over some bounding box of the problem. In other words, we shouldnāt assume points either side of the barrier in the test function are similar just because they are closely located in the <em>y</em> coordinate.
</p>
<h2 id="soap-film-smoothers">
Soap-film smoothers
</h2>
<p>
Bubble artists can do some amazing things with a few props and copious amounts of soapy solutions. If youāve ever seen a bubble artist perform, youāll never look at the little bottles of bubbles that kids use to blow simple round bubbles in the same way again. Whilst soap-film smoothers arenāt quite as amazing as the soapy wonders produced by bubble artists, how they work is directly related to one form of bubble art<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a>.
</p>
<p>
If we start from the simple kids-toy version for blowing round bubbles, then youāll know that there is a small loop within which an exceedingly thin film of soapy liquid is contained. Blowing through the loop deforms the soapy film, and if you blow gently, eventually you can deform the film enough that it detaches from the loop and forms a perfect, iridescent, soapy ball of fun<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a>. Bubble artists employ more complex loops, but the principle remains the same and in practice, this is <em>exactly</em> how soap-film smoothers work.
</p>
<p>
Return for a moment to Ramsayās test function shown above. The loop is formed by the boundary of the domain. Imagine a soapy film suspended within this loop, and further imagine that we can somehow blow over the region to deform the film in such as way as to move the film towards the data. In the test function above, weād need to āblowā on the film so that it deformed towards us in the upper stem of the U, and away from us in the lower stem (assuming that weāre mapping the data values to the z coordinate.) Quite a lot of complexity underlies exactly <em>how</em> the soap-film smoother achieves this, but the general principle is exceedingly simple.
</p>
<p>
Soap-film smoothers comprise two separate types of smoother; one for the boundary and one for the film itself. The boundary smoother is often a cyclic spline in order to have the ends of the spline join nicely at the āend pointsā of the boundary. If the value of the response at the boundary is known, such as lake depth being zero at the margin of the lake, then the boundary can be fixed at these values without needing a spline to model values on the boundary. If the response is not known at the boundary, it can be estimated using the boundary spline.
</p>
<h2 id="lake-bathymetric-data">
Lake bathymetric data
</h2>
<p>
What do soap films have to do with lake bathymetric data? Basically because the problem of modelling depth soundings is exactly the same as that illustrated by Ramsayās test function. We have a well defined boundary<a href="#fn5" class="footnote-ref" id="fnref5"><sup>5</sup></a>, and all but the most simple lakes have shoreline features that we donāt want to smooth across, such as peninsulars<a href="#fn6" class="footnote-ref" id="fnref6"><sup>6</sup></a>.
</p>
<p>
The figure below shows lake depth soundings from the <a href="https://en.wikipedia.org/wiki/Cosmeston_Lakes_Country_Park">Comeston Park Lakes</a>, two now-flooded former quarries joined by a narrow channel.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"rgdal"</span><span class="p">)</span><span class="w">
</span><span class="c1">## Update this if I can post the Comeston data</span><span class="w">
</span><span class="n">dataDIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"/home/gavin/work/projects/ccw/data/CCW_Final_Data/42721_Cosmeston_Park/."</span><span class="w">
</span><span class="n">outline</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readOGR</span><span class="p">(</span><span class="n">dataDIR</span><span class="p">,</span><span class="w"> </span><span class="s2">"42721_Cosmeston_Lake_lake_polyline"</span><span class="p">)</span><span class="w">
</span><span class="n">depth</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readOGR</span><span class="p">(</span><span class="n">dataDIR</span><span class="p">,</span><span class="w"> </span><span class="s2">"d17_42721_xyz"</span><span class="p">)</span><span class="w">
</span><span class="n">foutline</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fortify</span><span class="p">(</span><span class="n">outline</span><span class="p">)</span><span class="w">
</span><span class="n">fdepth</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">depth</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">foutline</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdepth</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">depth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_fixed</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Northing"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Easting"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_color_viridis</span><span class="p">()</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-show-comeston-lakes-baty-data-1.png" alt="Comeston Park Lakes depth sounding data" />
<figcaption>
Comeston Park Lakes depth sounding data
</figcaption>
</figure>
<p>
This example is very similar to that of Ramsayās test function. We donāt want to smooth across the narrow peninsular because there is no reason to presume the bed topography is the same on either side.
</p>
<h2 id="additive-models-for-lake-bathymetry-data">
Additive models for lake bathymetry data
</h2>
<p>
If we werenāt worried about the boundary, we could use a thin plate spline smoother (TPRS) to model how depth varies spatially. The TPRS basis is perfect for this as the <code>x</code> <code>y</code> data are in the same units. Hence a simple GAM would seem OK, if were werenāt worried about those pesky boundaries.
</p>
<p>
The wrong thing then would be to do the following, here not using the lake boundary information of zero depths.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">crds</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coordinates</span><span class="p">(</span><span class="n">outline</span><span class="p">)[[</span><span class="m">1</span><span class="p">]][[</span><span class="m">1</span><span class="p">]]</span><span class="w">
</span><span class="n">tprs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="o">-</span><span class="n">depth</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">60</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">depth</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">)</span><span class="w">
</span><span class="n">summary</span><span class="p">(</span><span class="n">tprs</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
-depth ~ s(os_x, os_y, k = 60)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.35075 0.07813 -68.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(os_x,os_y) 41.45 51.06 19.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.787 Deviance explained = 82%
-REML = 491.88 Scale est. = 1.6175 n = 265</code></pre>
</figure>
<p>
The fitted smoother uses about 40 degrees of freedom, and explains about 80% of the variance in the observed depths. To visualise the fitted surface, I create a data set of x and y coordinates over the bounding box of the spatial data. At this stage Iām not going to remove any of the points for prediction that are outside the lake as I want to show what the TPRS smoother is doing. The code basically
</p>
<ul>
<li>
sets up a 2.5 meter-resolution grid in the x and y directions
</li>
<li>
predicts from the model at each location
</li>
<li>
creates a temporary version of the predictions, setting all depths > 0 to <code>NA</code>, which removes some distracting behaviour far from the support of the observations.
</li>
</ul>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">grid.x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">tprs</span><span class="o">$</span><span class="n">var.summary</span><span class="p">,</span><span class="w">
</span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">crds</span><span class="p">[,</span><span class="m">1</span><span class="p">])),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">crds</span><span class="p">[,</span><span class="m">1</span><span class="p">])),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.5</span><span class="p">))</span><span class="w">
</span><span class="n">grid.y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">tprs</span><span class="o">$</span><span class="n">var.summary</span><span class="p">,</span><span class="w">
</span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">crds</span><span class="p">[,</span><span class="m">2</span><span class="p">])),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">crds</span><span class="p">[,</span><span class="m">2</span><span class="p">])),</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.5</span><span class="p">))</span><span class="w">
</span><span class="n">pdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">tprs</span><span class="o">$</span><span class="n">var.summary</span><span class="p">,</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">os_x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">grid.x</span><span class="p">,</span><span class="w"> </span><span class="n">ox_y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">grid.y</span><span class="p">))</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">pdata</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"os_x"</span><span class="p">,</span><span class="s2">"os_y"</span><span class="p">)</span><span class="w">
</span><span class="c1">##predictions</span><span class="w">
</span><span class="n">pdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdata</span><span class="p">,</span><span class="w"> </span><span class="n">Depth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">tprs</span><span class="p">,</span><span class="w"> </span><span class="n">pdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"response"</span><span class="p">))</span><span class="w">
</span><span class="n">tmp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pdata</span><span class="w"> </span><span class="c1"># temporary version...</span><span class="w">
</span><span class="n">take</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">tmp</span><span class="p">,</span><span class="w"> </span><span class="n">Depth</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w"> </span><span class="c1"># getting rid of > 0 depth points</span><span class="w">
</span><span class="n">tmp</span><span class="o">$</span><span class="n">Depth</span><span class="p">[</span><span class="n">take</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NA</span></code></pre>
</figure>
<p>
The TPRS fitted surface is plotted with the observed data using
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">foutline</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_raster</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tmp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Depth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdepth</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_fixed</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Northing"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Easting"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_viridis</span><span class="p">(</span><span class="n">na.value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-plot-tprs-model-fit-1.png" alt="Predicted depths over the bounding box of the observations from the TPRS smoother GAM." />
<figcaption>
Predicted depths over the bounding box of the observations from the TPRS smoother GAM.
</figcaption>
</figure>
<p>
Iāve purposely done a poor visualisation job<a href="#fn7" class="footnote-ref" id="fnref7"><sup>7</sup></a> in the above figure as I wanted to show how the TPRS smoother bleeds information across the peninsular. Ignore the predictions off into the top left & bottom right: concentrate on the peninsular. The TPRS spline is smoothing <code>depth</code> across this region, exactly what we donāt want. Itās almost as if the peninsular isnāt there.
</p>
<p>
Next weāll fit the soap-film smoother version. Iāll take this one a bit slower as we have some work to do to set up the boundary and knot locations that the smoother needs.
</p>
<p>
For lake bathymetries we have two set-up jobs to complete
</p>
<ol type="1">
<li>
create a boundary object, with known value of <code>0</code>
</li>
<li>
choose the number of knots and their locations over the domain of interest
</li>
</ol>
<p>
The second is, in my experience, most easily achieved by using the <em>list</em> form of the allowed options for the boundary<a href="#fn8" class="footnote-ref" id="fnref8"><sup>8</sup></a>. The list form for the boundary is a list within a list. Each sublist has at least <strong>two</strong> elements containing the x and y coordinates of the boundary polygon. A component <code>f</code> may also be included, which sets the boundary condition at each location; here we set this to <code>0</code> to indicate the depth tends to <code>0</code> at the lake shore. In the code below I create this from the <code>coordinates()</code> object created earlier.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">bound</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">crds</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">crds</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">crds</span><span class="p">))))</span></code></pre>
</figure>
<p>
Choosing the number and location of knots is trickier, especially if you are trying to automate this for a large number of lakes. The key requirement is that any knots are contained entirely <em>within</em> the lake boundary. <strong>mgcv</strong> provides the <code>inSide()</code> function to facilitate this. Unfortunately <code>inSide()</code> doesnāt provide <em>exactly</em> the same check for being inside the boundary as the one used by the soap-film smooth constructor called when you fit the model. The procedure I outline below is the one Iāve found most useful to date, but I make no guarantee that it is optimal nor that it will work for your data problem<a href="#fn9" class="footnote-ref" id="fnref9"><sup>9</sup></a>.
</p>
<p>
Here I choose to create a 10 by 10 regular grid of locations over the bounding box of the coordinates. From this grid I retain those points that are contained within the lake boundary.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10</span><span class="w">
</span><span class="n">gx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">crds</span><span class="p">[,</span><span class="m">1</span><span class="p">]),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">crds</span><span class="p">[,</span><span class="m">1</span><span class="p">]),</span><span class="w"> </span><span class="n">len</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w">
</span><span class="n">gy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">crds</span><span class="p">[,</span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">crds</span><span class="p">[,</span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">len</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w">
</span><span class="n">gp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">gx</span><span class="p">,</span><span class="w"> </span><span class="n">gy</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">gp</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"x"</span><span class="p">,</span><span class="s2">"y"</span><span class="p">)</span><span class="w">
</span><span class="n">knots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gp</span><span class="p">[</span><span class="n">with</span><span class="p">(</span><span class="n">gp</span><span class="p">,</span><span class="w"> </span><span class="n">inSide</span><span class="p">(</span><span class="n">bound</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">knots</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"os_x"</span><span class="p">,</span><span class="w"> </span><span class="s2">"os_y"</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">bound</span><span class="p">[[</span><span class="m">1</span><span class="p">]])</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"os_x"</span><span class="p">,</span><span class="w"> </span><span class="s2">"os_y"</span><span class="p">,</span><span class="w"> </span><span class="s2">"f"</span><span class="p">)</span></code></pre>
</figure>
<p>
The last two lines set boundary and knots names to match the variable names on the depth data used to fit the model.
</p>
<p>
The choice of 10 for the sides of the grid is useful here as that puts enough points within the lake for the knots of the smoother, but doesnāt require any nudging of the grid to get the selected points to fall nicely within the boundary. In other examples, Iāve needed to tailor the number of points in the grid and shift it by a few meters to get as many of the regular points to fall inside the boundary. You may even find that you need to locate the knots individually. Using <code>locator()</code> after plotting the lake outline is an expedient ā but entirely manual ā way to do this if you have too.
</p>
<p>
What this process looks like is shown in the figure below
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-plot-soap-film-set-up-1.png" alt="Illustration of the knot selection procedure. The large circles are the locations of the sparse regular grid of points over the bounding box of the data. The filled red circles are those grid points that are found inside the lake boundary and thus chosen as knots for the soap-film smoother. The small black dots are the locations of the observed depth data." />
<figcaption>
Illustration of the knot selection procedure. The large circles are the locations of the sparse regular grid of points over the bounding box of the data. The filled red circles are those grid points that are found inside the lake boundary and thus chosen as knots for the soap-film smoother. The small black dots are the locations of the observed depth data.
</figcaption>
</figure>
<p>
Fitting the soap-film model is quite similar to any other GAM you may have fitted with <strong>mgcv</strong>. The main exception is that you have to pass something to the <code>xt</code> argument of <code>s()</code>. If you delve into some of the more complex smoothers that have become available in <strong>mgcv</strong> in recent releases, youāll find yourself using <code>xt</code> a lot as it is the way to pass extra information to the basis constructor functions.
</p>
<p>
For soap-film smoothers you must pass <code>xt</code> a list with component <code>bnd</code> set to an appropriate boundary object ā here <code>bound</code> as created earlier. The knots that were created earlier need to be passed to the <code>knots</code> argument. The full call to <code>gam()</code> is shown below; the soap-film basis is specified using <code>bs = āsoā</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="o">-</span><span class="n">depth</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"so"</span><span class="p">,</span><span class="w"> </span><span class="n">xt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">bnd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">bound</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">depth</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span></code></pre>
</figure>
<p>
The soap-film smoother explains just over 75% of the variance in the data, using just under 30 degrees of freedom. It doesnāt explain quite as much variance as the TPRS model I looked at earlier, but is substantially simpler in terms of degrees of freedom (~30 vs ~40 respectively).
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">summary</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Family: gaussian
Link function: identity
Formula:
-depth ~ s(os_x, os_y, bs = "so", xt = list(bnd = bound))
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.275 0.204 -16.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(os_x,os_y) 27.27 38 19.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.742 Deviance explained = 76.8%
-REML = 501.32 Scale est. = 1.9607 n = 265</code></pre>
</figure>
<p>
Soap-film GAMs come with their own <code>plot()</code> method
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">lims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">crds</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">range</span><span class="p">)</span><span class="w">
</span><span class="n">ylim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lims</span><span class="p">[,</span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lims</span><span class="p">[,</span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">asp</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylim</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xlim</span><span class="p">,</span><span class="w"> </span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">scheme</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-soap-film-plot-method-1.png" alt="Contour plot of the fitted sop-film spline produced using plot.gam() with scheme = 2." />
<figcaption>
Contour plot of the fitted sop-film spline produced using <code>plot.gam()</code> with <code>scheme = 2</code>.
</figcaption>
</figure>
<p>
Notice how the contours of the fitted soap-film surface are parallel to the peninsular shoreline ā as weād expect if we studied lakes. Weāll return to this momentarily.
</p>
<p>
As we arenāt in the business of drawing pictures<a href="#fn10" class="footnote-ref" id="fnref10"><sup>10</sup></a>
</p>
<blockquote class="twitter-tweet" data-lang="en" align="center">
<p lang="en" dir="ltr">
If you want to draw pictures, base graphics is better than ggplot2. But most people donāt want to draw pictures with <a
href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a>
</p>
ā Hadley Wickham (<span class="citation" data-cites="hadleywickham">(<span class="citeproc-not-found" data-reference-id="hadleywickham"><strong>???</strong></span>)</span>) <a
href="https://twitter.com/hadleywickham/status/712336453317963776">March 22, 2016</a>
</blockquote>
<script async
src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>
I should plot the fitted model using <code>ggplot()</code>. Note that the <code>predict()</code> step here is slow ā I could probably speed it up a lot by removing all the points that are outside the lake boundary (see below) because we already know those points will be <code>NA</code>s and just hence dropped from any plotting or subsequent analysis.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pdata2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdata</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">Depth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdata</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">foutline</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_raster</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdata2</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Depth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdepth</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_fixed</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Northing"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Easting"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_viridis</span><span class="p">(</span><span class="n">na.value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-ggplot-soap-film-1.png" alt="The fitted surface achieved using a soap-film smoother" />
<figcaption>
The fitted surface achieved using a soap-film smoother
</figcaption>
</figure>
<p>
The first thing to notice is that the <code>predict()</code> method automatically sets points outside the boundary to <code>NA</code> for soap-film smoother models: you have to do this manually with the other types of smoother.
</p>
<p>
The main improvement in the soap-film model is the performance of the fitted depth surface around the peninsular. Notice now how on the right of the peninsular, the depth lessens towards the shoreline, and on the left depth increases from 0 away from the peninsular. Importantly, however, the deeper points on the right are not leaking information across the peninsular.
</p>
<p>
We could have achieved a better fit with the TPRS model by including the boundary coordinates with the <code>depth</code> data with depths <code>0</code>. This would have improved the performance around the edge of the lake, but it wouldnāt have had the same effect as the soap-film smoother around the peninsular. Why so? Well, in the soap-film, we set the values of the boundary to be zero and the soap-film smooths from the data points to those known values but wonāt smooth across the boundary of the domain. The TPRS model however would treat the 0 depth values differently: in simple terms it will smooth through the values, not to them. Hence the spline will get pulled towards zero somewhat, but the spline will still be āaveragingā the depth data from a local region around the peninsular, information which includes the deeper data we donāt want to leak.
</p>
<p>
To help compare the two surfaces, I do a little more data munging to remove TPRS points outside the lake boundary and combine them with the soap-film data.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">inlake</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pdata</span><span class="p">,</span><span class="w"> </span><span class="n">inSide</span><span class="p">(</span><span class="n">bound</span><span class="p">,</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">os_y</span><span class="p">))</span><span class="w">
</span><span class="n">pdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pdata</span><span class="p">[</span><span class="n">inlake</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">pdata2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">rbind</span><span class="p">(</span><span class="n">pdata</span><span class="p">,</span><span class="w"> </span><span class="n">pdata2</span><span class="p">),</span><span class="w">
</span><span class="n">Model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"TPRS"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Soap-film"</span><span class="p">),</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">pdata</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">pdata2</span><span class="p">))))</span><span class="w">
</span><span class="c1">## let's drop the NAs from the Soap-film too...</span><span class="w">
</span><span class="n">take</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pdata2</span><span class="p">,</span><span class="w"> </span><span class="o">!</span><span class="nf">is.na</span><span class="p">(</span><span class="n">Depth</span><span class="p">))</span><span class="w">
</span><span class="n">pdata2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pdata2</span><span class="p">[</span><span class="n">take</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">poutline</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">rbind</span><span class="p">(</span><span class="n">foutline</span><span class="p">,</span><span class="w"> </span><span class="n">foutline</span><span class="p">),</span><span class="w">
</span><span class="n">Model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"TPRS"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Soap-film"</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">foutline</span><span class="p">)))</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">poutline</span><span class="p">)[</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"os_x"</span><span class="p">,</span><span class="w"> </span><span class="s2">"os_y"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">poutline</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_raster</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdata2</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Depth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_path</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdepth</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">os_y</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_fixed</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Northing"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Easting"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_viridis</span><span class="p">(</span><span class="n">na.value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NA</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Model</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"top"</span><span class="p">,</span><span class="w"> </span><span class="n">legend.key.width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="m">2.5</span><span class="p">,</span><span class="w"> </span><span class="s2">"cm"</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/soap-film-smoothers-combined-plot-1.png" alt="Comparison of fitted depth surfaces for the soap-film and TPRS smoother models" />
<figcaption>
Comparison of fitted depth surfaces for the soap-film and TPRS smoother models
</figcaption>
</figure>
<p>
The effect is subtle in these plots, but the differences between the two are clear. Most important, the leakage of information across the peninsular, clearly visible in the TPRS model is removed in the soap-film version.
</p>
<p>
Soap-film smoothers are not the only way to approach finite area smoothing. David Miller did his PhD with Simon Wood and developed the generalised distance spline approach to the finite area smoothing problem <span class="citation" data-cites="Miller2014-kb">(Miller and Wood, 2014)</span>, and Ramsay introduced his FELSPLINE method <span class="citation" data-cites="Ramsay2002-mv">(Ramsay, 2002)</span>. Iāve not had chance to investigate Davidās generalised distance spline method yet but if I do, Iāll no doubt write a post comparing the results with the soap-film method.
</p>
<h2 id="references" class="unnumbered">
References
</h2>
<div id="refs" class="references">
<div id="ref-Miller2014-kb">
<p>
Miller, D. L., and Wood, S. N. (2014). Finite area smoothing with generalized distance splines. <em>Environmental and ecological statistics</em> 21, 715ā731. doi:<a href="https://doi.org/10.1007/s10651-014-0277-4">10.1007/s10651-014-0277-4</a>.
</p>
</div>
<div id="ref-Ramsay2002-mv">
<p>
Ramsay, T. (2002). Spline smoothing over difficult regions. <em>Journal of the Royal Statistical Society. Series B, Statistical methodology</em> 64, 307ā319. doi:<a href="https://doi.org/10.1111/1467-9868.00339">10.1111/1467-9868.00339</a>.
</p>
</div>
<div id="ref-Wood2008-gy">
<p>
Wood, S. N., Bravington, M. V., and Hedley, S. L. (2008). Soap film smoothing. <em>Journal of the Royal Statistical Society. Series B, Statistical methodology</em> 70, 931ā955. doi:<a href="https://doi.org/10.1111/j.1467-9868.2008.00665.x">10.1111/j.1467-9868.2008.00665.x</a>.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
smoothing over a domain with known boundaries, like a lake<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
See the example in <code>?fs.test</code> after loading package <strong>mgcv</strong><a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn3">
<p>
and soap-film smooths are pretty damn cool all the same!<a href="#fnref3" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn4">
<p>
or a soapy, sticky mess depending upon your point of viewā¦<a href="#fnref4" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn5">
<p>
ignoring the fact that lake levels often rise and fall through the year or over years.<a href="#fnref5" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn6">
<p>
because topography<a href="#fnref6" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn7">
<p>
I should have removed all prediction points <em>outside</em> the lake as these are very far from the support of the data.<a href="#fnref7" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn8">
<p>
The other form is a list with sub-data frame(s), each data frame is a separate loop.<a href="#fnref8" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn9">
<p>
It is probably worth trying a range of knots and varying their locations if you are taking this very seriously.<a href="#fnref9" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn10">
<p>
Sorry Hadley, I couldnāt resist.<a href="#fnref10" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Additive modelling global temperature time series: revisited
Gavin L. Simpson
2016-03-25T00:00:00-06:00
2016-03-25T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2016/03/25/additive-modeling-global-temperature-series-revisited/
<p>
Quite some time ago, back in 2011, I wrote a <a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">post</a> that used an additive model to fit a smooth trend to the then-current Hadley Centre/CRU global temperature time series data set. Since then the media and scientific papers have been full of reports of record warm temperatures in the past couple of years, of controversies (imagined) regarding data-changes to suit the hypothesis of human induce global warming, and the brouhaha over whether global warming had stalled; the great <a href="https://en.wikipedia.org/wiki/Global_warming_hiatus">global warming hiatus or pause</a>. So it seemed like a good time to revisit that analysis and update it using the latest HadCRUT data.
</p>
<p>
Quite some time ago, back in 2011, I wrote a <a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">post</a> that used an additive model to fit a smooth trend to the then-current Hadley Centre/CRU global temperature time series data set. Since then the media and scientific papers have been full of reports of record warm temperatures in the past couple of years, of controversies (imagined) regarding data-changes to suit the hypothesis of human induce global warming, and the brouhaha over whether global warming had stalled; the great <a href="https://en.wikipedia.org/wiki/Global_warming_hiatus">global warming hiatus or pause</a>. So it seemed like a good time to revisit that analysis and update it using the latest HadCRUT data.
</p>
<p>
A further motivation was my reading <span class="citation" data-cites="Cahill2015-tt">Cahill et al.Ā (2015)</span>, in which the authors use a Bayesian change point model for global temperatures. This model is essentially piece-wise linear but with smooth transitions between the piece-wise linear components. I donāt immediately see where in their Bayesian model the smooth transitions come from, but thatās what they show. My gut reaction was why piece-wise linear with smooth transitions? Why not smooth everywhere? And thatās what the additive model I show here assumes.
</p>
<p>
First, I grab the data <span class="citation" data-cites="Morice2012-wk">(Morice et al., 2012)</span> from the Hadley Centreās website and load it into R
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"curl"</span><span class="p">)</span><span class="w">
</span><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">curl_download</span><span class="p">(</span><span class="s2">"http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/time_series/HadCRUT.4.4.0.0.annual_ns_avg.txt"</span><span class="p">,</span><span class="w"> </span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">gtemp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.table</span><span class="p">(</span><span class="n">tmpf</span><span class="p">,</span><span class="w"> </span><span class="n">colClasses</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="s2">"numeric"</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">))[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="c1"># only want some of the variables</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">gtemp</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Temperature"</span><span class="p">)</span></code></pre>
</figure>
<p>
The values in <code>Temperature</code> are anomalies relative to 1961ā1990, in degrees C.
</p>
<p>
The model I fitted in the last post was
</p>
<p>
[ y = _0 + f() + , N(0, ^2) ]
</p>
<p>
where we have a smooth function of <code>Year</code> as the trend, and allow for possibly correlated residuals via correlation matrix ( ).
</p>
<p>
The data set contains a partial set of observations for 2016, but seeing as that year is (at the time of writing) incomplete, I delete that observation.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">gtemp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">gtemp</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="c1"># -1 drops the last row</span></code></pre>
</figure>
<p>
The data are shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span><span class="w">
</span><span class="n">theme_set</span><span class="p">(</span><span class="n">theme_bw</span><span class="p">())</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">gtemp</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Temperature</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-plot-temperature-data-1.png" alt="HadCRUT4 global mean temperature anomaly" />
<figcaption>
HadCRUT4 global mean temperature anomaly
</figcaption>
</figure>
<p>
The model described above can be fitted using the <code>gamm()</code> function in the <strong>mgcv</strong> package. There are other options that allow one to use <code>gam()</code>, or even <code>bam()</code> in the same package, which are simpler, but I want to keep this post consistent with the one from a few years ago, so <code>gamm()</code> it is. Recall that <code>gamm()</code> represents the additive model as a mixed effects model via the well-known equivalence between random effects and splines, and fits the model using <code>lme()</code>. This allows for correlation structures in the residuals. Previously we saw that an AR(1) process in the residuals was the best fitting of the models tried, so we start with that and then try a model with AR(2) errors.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: nlme</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Year</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gtemp</span><span class="p">,</span><span class="w"> </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Year</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gtemp</span><span class="p">,</span><span class="w"> </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span></code></pre>
</figure>
<p>
A generalised likelihood ratio test suggests little support for the more complex AR(2) errors model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">anova</span><span class="p">(</span><span class="n">m1</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m2</span><span class="o">$</span><span class="n">lme</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Model df AIC BIC logLik Test L.Ratio p-value
m1$lme 1 5 -277.7465 -262.1866 143.8733
m2$lme 2 6 -278.2519 -259.5799 145.1259 1 vs 2 2.50538 0.1135</code></pre>
</figure>
<p>
The AR(1) has successfully modelled most of the residual correlation
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ACF</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">acf</span><span class="p">(</span><span class="n">resid</span><span class="p">(</span><span class="n">m1</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normalized"</span><span class="p">),</span><span class="w"> </span><span class="n">plot</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">ACF</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="nf">unclass</span><span class="p">(</span><span class="n">ACF</span><span class="p">)[</span><span class="nf">c</span><span class="p">(</span><span class="s2">"acf"</span><span class="p">,</span><span class="w"> </span><span class="s2">"lag"</span><span class="p">)]),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"ACF"</span><span class="p">,</span><span class="s2">"Lag"</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">ACF</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Lag</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ACF</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_hline</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_segment</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">xend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Lag</span><span class="p">,</span><span class="w"> </span><span class="n">yend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-plot-acf-1.png" alt="Autocorrelation function of residuals from the additive model with AR(1) errors" />
<figcaption>
Autocorrelation function of residuals from the additive model with AR(1) errors
</figcaption>
</figure>
<p>
Before drawing the fitted trend, I want to put a simultaneous confidence interval around the estimate. <strong>mgcv</strong> makes this very easy to do via <em>posterior simulation</em>. To simulate from the fitted model, I have written a <code>simulate.gamm()</code> method for the <code>simulate()</code> generic that ships with R. The code below downloads the Gist containing the <code>simulate.gam</code> code and then uses it to simulate from the model at 200 locations over the time period of the observations. Iāve written about posterior simulation from GAMs before, so if the code below or the general idea isnāt clear, I suggest you check out the <a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">earlier post</a>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">curl_download</span><span class="p">(</span><span class="s2">"https://gist.githubusercontent.com/gavinsimpson/d23ae67e653d5bfff652/raw/25fd719c3ab699e48927e286934045622d33b3bf/simulate.gamm.R"</span><span class="p">,</span><span class="w"> </span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">source</span><span class="p">(</span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">gtemp</span><span class="p">,</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="nf">min</span><span class="p">(</span><span class="n">Year</span><span class="p">),</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Year</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)))</span><span class="w">
</span><span class="n">sims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">nsim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">)</span><span class="w">
</span><span class="n">ci</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">sims</span><span class="p">,</span><span class="w"> </span><span class="m">1L</span><span class="p">,</span><span class="w"> </span><span class="n">quantile</span><span class="p">,</span><span class="w"> </span><span class="n">probs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">))</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w">
</span><span class="n">fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m1</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">),</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ci</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w">
</span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ci</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">])</span></code></pre>
</figure>
<p>
Having arranged the fitted values and upper and lower simultaneous confidence intervals tidily they can be added easily to the existing plot of the datat
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">),</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"grey"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-plot-fitted-trend-1.png" alt="Estimated trend in global mean temperature plus 95% simultaneous confidence interval" />
<figcaption>
Estimated trend in global mean temperature plus 95% simultaneous confidence interval
</figcaption>
</figure>
<p>
Whilst the simultaneous confidence interval shows the uncertainty in the fitted trend, it isnāt as clear about what form this uncertainty takes; for example, periods where there is little change or large uncertainty are often characterised by a wide range range of functional forms, not just flat, smooth functions. To get a sense of the uncertainty in the <em>shapes</em> of the simulated trends we can plot some of the draws from the posterior distribution of the model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">50</span><span class="w">
</span><span class="n">sims2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">sims</span><span class="p">[,</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">S</span><span class="p">)]),</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"sim"</span><span class="p">,</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">S</span><span class="p">)))</span><span class="w">
</span><span class="n">sims2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">stack</span><span class="p">(</span><span class="n">sims2</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Temperature"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Simulation"</span><span class="p">))</span><span class="w">
</span><span class="n">sims2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">sims2</span><span class="p">,</span><span class="w"> </span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">newd</span><span class="o">$</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">S</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">sims2</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Temperature</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Simulation</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-plot-simulations-1.png" alt="50 random simulated trends drawn from the posterior distribution of the fitted model" />
<figcaption>
50 random simulated trends drawn from the posterior distribution of the fitted model
</figcaption>
</figure>
<p>
If you look closely at the period 1850ā1900, youāll notice a wide range of trends through this period, each of which is consistent with the fitted model but illustrates the uncertainty in the estimates of the spline coefficients. An additional factor is that these splines have a global amount of smoothness; once the smoothness parameter(s) are estimated, the smoothness allowance this affords is spread evenly over the fitted function. <em>Adaptive</em> splines would solve this problem as they in effect allow you to spread the smoothness allowance unevenly, using it sparingly where there is no smooth variation in he data and applying it liberally where there is.
</p>
<p>
An instructive visualisation for the period of the purported pause or hiatus in global warming is to look at the shapes of the posterior simulations and the slopes of the trends for each year. I first look at the posterior simulations:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">sims2</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Temperature</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Simulation</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xlim</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1995</span><span class="p">,</span><span class="w"> </span><span class="m">2015</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylim</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0.2</span><span class="p">,</span><span class="w"> </span><span class="m">0.75</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Removed 8750 rows containing missing values (geom_path).</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-plot-simulations-pause-period-1.png" alt="50 random simulated trends drawn from the posterior distribution of the fitted model: 1995ā2015" />
<figcaption>
50 random simulated trends drawn from the posterior distribution of the fitted model: 1995ā2015
</figcaption>
</figure>
<p>
Whilst the plot only shows 50 of the 10,000 posterior draws, itās pretty clear that, in these data at least, there is little or no support for the pause hypothesis; most of the posterior simulations are linearly increasing over the period of interest. Only one or two show a marked shallowing of the slope of the simulated trend through the period.
</p>
<p>
The first derivatives of the fitted trend can be used to determine where temperatures are increasing or decreasing. Using the standard error of the derivative or posterior simulation we can also say where the confidence interval on the derivative doesnāt include 0 ā suggesting statistically significant change in temperature.
</p>
<p>
The code below uses some functions I wrote to compute the first derivatives of GAM(M) model terms via posterior simulation. Iāve <a href="/2014/06/16/simultaneous-confidence-intervals-for-derivatives/">written about</a> this method before, so I suggest you check out that post if any of this isnāt clear.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">curl_download</span><span class="p">(</span><span class="s2">"https://gist.githubusercontent.com/gavinsimpson/ca18c9c789ef5237dbc6/raw/295fc5cf7366c831ab166efaee42093a80622fa8/derivSimulCI.R"</span><span class="p">,</span><span class="w"> </span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">source</span><span class="p">(</span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">fd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">derivSimulCI</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: MASS</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">CI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">fd</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="o">$</span><span class="n">simulations</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">quantile</span><span class="p">,</span><span class="w"> </span><span class="n">probs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">))</span><span class="w">
</span><span class="n">sigD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">signifD</span><span class="p">(</span><span class="n">fd</span><span class="p">[[</span><span class="s2">"Year"</span><span class="p">]]</span><span class="o">$</span><span class="n">deriv</span><span class="p">,</span><span class="w"> </span><span class="n">fd</span><span class="p">[[</span><span class="s2">"Year"</span><span class="p">]]</span><span class="o">$</span><span class="n">deriv</span><span class="p">,</span><span class="w"> </span><span class="n">CI</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">CI</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w">
</span><span class="n">eval</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">newd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w">
</span><span class="n">derivative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fd</span><span class="p">[[</span><span class="s2">"Year"</span><span class="p">]]</span><span class="o">$</span><span class="n">deriv</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="c1"># computed first derivative</span><span class="w">
</span><span class="n">fdUpper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CI</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="c1"># upper CI on first deriv</span><span class="w">
</span><span class="n">fdLower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">CI</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="c1"># lower CI on first deriv</span><span class="w">
</span><span class="n">increasing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sigD</span><span class="o">$</span><span class="n">incr</span><span class="p">,</span><span class="w"> </span><span class="c1"># where is curve increasing?</span><span class="w">
</span><span class="n">decreasing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sigD</span><span class="o">$</span><span class="n">decr</span><span class="p">)</span><span class="w"> </span><span class="c1"># ... or decreasing?</span></code></pre>
</figure>
<p>
A <strong>ggplot2</strong> version of the derivatives is produced using the code below. The two additional <code>geom_line()</code> calls add thick lines over sections of the derivative plot to illustrate those points where zero is <em>not</em> contained within the confidence interval of the first derivative.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">newd</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">derivative</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdUpper</span><span class="p">,</span><span class="w"> </span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fdLower</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"grey"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">increasing</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">decreasing</span><span class="p">),</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="nf">expression</span><span class="p">(</span><span class="n">italic</span><span class="p">(</span><span class="n">hat</span><span class="p">(</span><span class="n">f</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="s2">"'"</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="n">Year</span><span class="p">)))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">"Year"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Removed 74 rows containing missing values (geom_path).</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Warning: Removed 190 rows containing missing values (geom_path).</code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-plot-derivatives-1.png" alt="First derivative of the fitted trend plus 95% simultaneous confidence interval" />
<figcaption>
First derivative of the fitted trend plus 95% simultaneous confidence interval
</figcaption>
</figure>
<p>
Looking at this plot, despite the large (and expected) uncertainty in the derivative of the fitted trend towards the end of the observation period, the first derivatives of at least 95% of the 10,000 posterior simulations are all bounded well above zero. Iāll take a closer look at this now, plotting kernel density estimates of the posterior distribution of first derivatives evaluated at each year for the period of interest.
</p>
<p>
First I generate another 10,000 simulations from the posterior of the fitted model, this time for each year in the interval 1998ā2015. Then I do a little processing to get the derivatives into a format suitable for plotting with <strong>ggplot</strong> and finally create kernel density estimate plots faceted by <code>Year</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">nsim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">pauseD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">derivSimulCI</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nsim</span><span class="p">,</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1998</span><span class="p">,</span><span class="w"> </span><span class="m">2015</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)))</span><span class="w">
</span><span class="n">annSlopes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">setNames</span><span class="p">(</span><span class="n">stack</span><span class="p">(</span><span class="n">setNames</span><span class="p">(</span><span class="n">data.frame</span><span class="p">(</span><span class="n">pauseD</span><span class="o">$</span><span class="n">Year</span><span class="o">$</span><span class="n">simulations</span><span class="p">),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"sim"</span><span class="p">,</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nsim</span><span class="p">)))),</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Derivative"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Simulations"</span><span class="p">))</span><span class="w">
</span><span class="n">annSlopes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">annSlopes</span><span class="p">,</span><span class="w"> </span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">1998</span><span class="p">,</span><span class="w"> </span><span class="m">2015</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nsim</span><span class="p">))</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">annSlopes</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Derivative</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"density"</span><span class="p">,</span><span class="w"> </span><span class="n">trim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">Year</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-derivatives-per-year-1.png" alt="Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years" />
<figcaption>
Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years
</figcaption>
</figure>
<p>
We can also look at the smallest derivative for each year over all of the 10,000 posterior simulations
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">minD</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">aggregate</span><span class="p">(</span><span class="n">Derivative</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">annSlopes</span><span class="p">,</span><span class="w"> </span><span class="n">FUN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">min</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">minD</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Derivative</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-min-sim-derivative-1.png" alt="Dotplot showing the minimum first derivative over 10,000 posterior simulations from the fitted additive model" />
<figcaption>
Dotplot showing the minimum first derivative over 10,000 posterior simulations from the fitted additive model
</figcaption>
</figure>
<p>
Only 4 of the 18 years have a single simulation with a derivative less than 0. We can also plot all the kernel density estimates on the same plot to see if there is much variation between years (there doesnāt appear to be much going on from the previous figures).
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"viridis"</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">annSlopes</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Derivative</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">stat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"density"</span><span class="p">,</span><span class="w"> </span><span class="n">trim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">scale_color_viridis</span><span class="p">(</span><span class="n">option</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"magma"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"top"</span><span class="p">,</span><span class="w"> </span><span class="n">legend.key.width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">unit</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s2">"cm"</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/additive-modeling-global-temperature-series-revisited-derivatives-single-panel-1.png" alt="Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years. The colour of each density estimate differentiates individual years" />
<figcaption>
Kernel density estimates of the first derivative of posterior simulations from the fitted trend model for selected years. The colour of each density estimate differentiates individual years
</figcaption>
</figure>
<p>
As anticipated, thereās very little between-year shift in the slopes of the trends simulated from the posterior distribution of the model.
</p>
<p>
Returning to <span class="citation" data-cites="Cahill2015-tt">Cahill et al.Ā (2015)</span> for a moment; the fitted trend from their Bayesian change point model is very similar to the fitted spline. There are some differences in the early part of the series; where their model has a single piecewise linear function through 1850ā1900, the additive model suggests a small decrease in global temperatures leading up to 1900. Thereafter the models are very similar, with the exception that the smooth transitions between periods of increase are somewhat longer with the additive model than the one of <span class="citation" data-cites="Cahill2015-tt">Cahill et al.Ā (2015)</span>.
</p>
<h2 id="references" class="unnumbered">
References
</h2>
<div id="refs" class="references">
<div id="ref-Cahill2015-tt">
<p>
Cahill, N., Rahmstorf, S., and Parnell, A. C. (2015). Change points of global temperature. <em>Environmental research letters: ERL [Web site]</em> 10, 084002. doi:<a href="https://doi.org/10.1088/1748-9326/10/8/084002">10.1088/1748-9326/10/8/084002</a>.
</p>
</div>
<div id="ref-Morice2012-wk">
<p>
Morice, C. P., Kennedy, J. J., Rayner, N. A., and Jones, P. D. (2012). Quantifying uncertainties in global and regional temperature change using an ensemble of observational estimates: The HadCRUT4 data set. <em>J. Geophys. Res.</em> 117, D08101.
</p>
</div>
</div>
Better use of transfer functions?
Gavin L. Simpson
2015-12-16T00:00:00-06:00
2015-12-16T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/12/16/agu-transfer-functions/
<p>
Transfer functions have had a bit of a hard time of late following Steve Juggins <span class="citation" data-cites="Juggins2013-dc">(2013)</span> convincing demonstration that 1) secondary gradients can influence your model, and 2) that variation down-core in a secondary variable can induce a signal in the thing being reconstructed. This was followed up by further comment on diatom-TP reconstructions <span class="citation" data-cites="Juggins2013-gf">(Juggins et al., 2013)</span>, and not to be left out, chironomid transfer functions have come in from some heat, if the last (that I went to) IPS meeting was any indication. In a session at the 2015 Fall Meeting of the AGU, my interest was piqued by <a href="http://www.earth.northwestern.edu/~yarrow/">Yarrow Axford</a>ās talk using chironomid temperature reconstructions, but not for the reasons you might be thinking.
</p>
<div id="refs" class="references">
<div id="ref-Juggins2013-dc">
<p>
Juggins, S. (2013). Quantitative reconstructions in palaeolimnology: New paradigm or sick science? <em>Quaternary science reviews</em> 64, 20ā32. doi:<a href="https://doi.org/10.1016/j.quascirev.2012.12.014">10.1016/j.quascirev.2012.12.014</a>.
</p>
</div>
<div id="ref-Juggins2013-gf">
<p>
Juggins, S., John Anderson, N., Ramstack Hobbs, J. M., and Heathcote, A. J. (2013). Reconstructing epilimnetic total phosphorus using diatoms: Statistical and ecological constraints. <em>Journal of paleolimnology</em> 49, 373ā390. doi:<a href="https://doi.org/10.1007/s10933-013-9678-x">10.1007/s10933-013-9678-x</a>.
</p>
</div>
</div>
<p>
Transfer functions have had a bit of a hard time of late following Steve Juggins <span class="citation" data-cites="Juggins2013-dc">(2013)</span> convincing demonstration that 1) secondary gradients can influence your model, and 2) that variation down-core in a secondary variable can induce a signal in the thing being reconstructed. This was followed up by further comment on diatom-TP reconstructions <span class="citation" data-cites="Juggins2013-gf">(Juggins et al., 2013)</span>, and not to be left out, chironomid transfer functions have come in from some heat, if the last (that I went to) IPS meeting was any indication. In a session at the 2015 Fall Meeting of the AGU, my interest was piqued by <a href="http://www.earth.northwestern.edu/~yarrow/">Yarrow Axford</a>ās talk using chironomid temperature reconstructions, but not for the reasons you might be thinking.
</p>
<p>
<a href="https://agu.confex.com/agu/fm15/meetingapp.cgi/Paper/70321">Yarrowās talk</a> covered her work on temperature reconstructions from lakes around Greenland. For some reasons that she didnāt go into the ice core records arenāt the ultimate decider of temperature trends in Greenland over the Holocene. Other temperature records are needed to better characterise variations in temperature over the last 10,000 years. Which is where the chironomids come inā¦
</p>
<p>
For those of you now expecting a rant about the abuse and misuse of transfer functions, well, sorry to disappoint. What interested me about Yarrowās talk was that she addressed upfront the potential for issues with transfer functions reconstructions. This acceptance of the problems from people using transfer functions is something new to me<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> and it is a welcome development indeed.
</p>
<p>
Yarrow decided that she would only trust chironomid temperature reconstructions if they met three criteria:
</p>
<ol type="1">
<li>
that the core contains several species sensitive to temperature change, at warm and cold temperatures, and that the record wasnāt dominated by aggregated taxa like <em>Tanytarsus</em> with consequently broad or no temperature sensitivity,
</li>
<li>
that, using independent training sets the temperature reconstructions yielded the same general trend, and
</li>
<li>
that the potential for change in secondary gradients to be contained in the sediment record was minimal.
</li>
</ol>
<p>
Letās be clear; any additional thought that people put into assessing the quality of reconstructions is to be applauded. Iām not convinced by the merit or utility of all of Yarrowās rules but this sort of thinking is refreshing.
</p>
<p>
Rule 1 seems odd to me. Perhaps this is because I donāt know much about chironomids? It seems self-evident that reconstructions from assemblages dominated by non-sensitive taxa arenāt to be trusted or might be subject to lots of noise or influence from secondary sources. It is also difficult to operationalise this rule; how high a proportion of the assemblage do we allow for these non-sensitive taxa before we worry about the reconstruction? I suspect this could be informed by some simulations similar to those used in Steveās Sick Science paper if someone wanted to do it.
</p>
<p>
Rule 2, for me, is the weakest. Yarrow used a NE North American chironomid-temperature data set as the main training set because of the geographical location of her lake sites, but used a separate training set of <a href="http://www.southampton.ac.uk/geography/about/staff/pgl.page">Pete Langdon</a>ās from Iceland as the independent data set. This Iceland data set used different taxonomic decisions and groupings, the idea being that if similar reconstructions were produced using it, we can have more confidence in the reconstruction. The problem with all this however is that reconstructions generated by independent training sets arenāt independent because they obviously use the same core assemblage data.
</p>
<p>
Transfer functions are largely just fancy filters of assemblage data; to generalise broadly, if the species composition changes weāll see a change in the reconstructed values and the magnitude of this change in the reconstruction is determined by whether the species that are changing abundance are important indicators in the training set, or not, for the variable of interest. This is where the real elephant in the transfer function room lives; no matter how carefully you build your training set, you are always at the mercy of whatever signals your lake recorded in the sediments. Iām getting ahead of myself however.
</p>
<p>
As far as all this pertains to Yarrowās Rule 2, we must be careful not to think of these different reconstructions as being <strong>independent</strong>. We have only one record of compositional change so we canāt generate radically different reconstructions, unless that is if the training sets contain radically different species-environment relationships. I find it hard to believe that any training set from comparable environments will embed radically different species-environment relationships; organisms like chironomids just donāt seem built that way.
</p>
<p>
So where does that leave Rule 2? I would say that if the reconstructions produced are qualitatively different (different trends, implications, ā¦), that should set the alarm bells ringing. Thereās clearly something in the reconstruction that is sensitive to the sorts of taxonomic aggregations that differentiate the training sets.
</p>
<p>
But what if the reconstructions are qualitatively similar? Iām far from convinced that this should give any assurance that the reconstruction is any more reliable that before. It could just as easily be that any secondary gradients induce trends in the reconstructions in the same way in both training sets.
</p>
<p>
Which brings me to Yarrowās Rule 3. Just as we minimise, to the best of our ability, the secondary gradients in training sets, minimising the potential for secondary influences in the core record is as important. Yarrow did this in the case of her research by choosing lakes in catchments with no catchment vegetation or any soil to speak of ā from the photo she showed of one of her sites, she nailed this one!
</p>
<p>
Development of soils and vegetation in catchments has profound effects on the lake ecosystem and especially in the forms and sources of nutrients and other compounds to the lake. Such effects have logical consequences for the lake biota. Now, while the initial development of soil processes and vegetation in the Arctic at the end of the last glacial and start of the Holocene are clearly temperature driven, if you are interested in temperature variation throughout the Holocene, there are lots of things that might affect nutrient inputs from catchments, or modify the in-lake environment that are <em>not</em> driven by temperature. In those circumstances, if your interest is in neoglacial cooling, the medieval warm period, etc, interference from these secondary gradients can be a real problem.
</p>
<p>
What really impressed me about Yarrowās use of the transfer functions was that clearly a lot of thought had gone into site selection and how to best guard against the inherent problems in the methods. Perhaps Iāve been away from jobbing palaeolimnologists for too long<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> but, quibbles about Rules 1 and 2 aside, this is welcome and long overdue attention that we need more of.
</p>
<h3 id="references" class="unnumbered">
References
</h3>
<div id="refs" class="references">
<div id="ref-Juggins2013-dc">
<p>
Juggins, S. (2013). Quantitative reconstructions in palaeolimnology: New paradigm or sick science? <em>Quaternary science reviews</em> 64, 20ā32. doi:<a href="https://doi.org/10.1016/j.quascirev.2012.12.014">10.1016/j.quascirev.2012.12.014</a>.
</p>
</div>
<div id="ref-Juggins2013-gf">
<p>
Juggins, S., John Anderson, N., Ramstack Hobbs, J. M., and Heathcote, A. J. (2013). Reconstructing epilimnetic total phosphorus using diatoms: Statistical and ecological constraints. <em>Journal of paleolimnology</em> 49, 373ā390. doi:<a href="https://doi.org/10.1007/s10933-013-9678-x">10.1007/s10933-013-9678-x</a>.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
Having not been the recent IPS meeting in Lanzhou Iām even further removed from the application of transfer functions these days. I was aware that there had been some movement on both sides to identify ways forward for people wanting to implement or create reconstructions, however.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
Iāve only been away from the ECRC for coming on three years!<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
AGU Fall Meeting 2015
Gavin L. Simpson
2015-12-14T00:00:00-06:00
2015-12-14T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/12/14/agu-2015-poster/
<p>
My poster, <em>Rapid ecological change in lake ecosystems</em> (GC13G-1236) in the <em>Sedimentary records of threshold change</em> (GC13G Moscone South Poster Hall 1340ā1800, monday 14th December) describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores. Using data from a varved lake, Baldeggersee, Switzerland, I use location scale generalised additive models to simultaneously model the mean (trend) and the variance of a time series of diatom counts. Wavelets were used to investigate further variation in species dynamics during the well-documented history of eutrophication at the lake.
</p>
<p>
My poster, <em>Rapid ecological change in lake ecosystems</em> (GC13G-1236) in the <em>Sedimentary records of threshold change</em> (GC13G Moscone South Poster Hall 1340ā1800, monday 14th December) describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores. Using data from a varved lake, Baldeggersee, Switzerland, I use location scale generalised additive models to simultaneously model the mean (trend) and the variance of a time series of diatom counts. Wavelets were used to investigate further variation in species dynamics during the well-documented history of eutrophication at the lake.
</p>
<p>
Both of these techniques may be applied to data from less ideal situations, where observations are irregularly sampled in time and have varying sample intervals/effects of time averaging.
</p>
<p>
A PDF of my poster can be downloaded from <a href="https://doi.org/10.6084/m9.figshare.2008245">Figshare</a>.
</p>
<div style="margin-left: auto; margin-right: auto; width: 700px; height: 601px;">
<p>
<iframe src="https://widgets.figshare.com/articles/2008245/embed?show_title=1" frameborder="0" width="700" height="601">
</iframe>
</p>
</div>
Are some seasons warming more than others?
Gavin L. Simpson
2015-11-23T00:00:00-06:00
2015-11-23T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/11/23/are-some-seasons-warming-more-than-others/
<p>
I ended the <a href="/2015/11/21/climate-change-and-spline-interactions/">last post</a> with some pretty plots of air temperature change within and between years in the <a href="http://www.metoffice.gov.uk/hadobs/hadcet/">Central England Temperature series</a>. The elephant in the room<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> at the end of that post was <em>is the change in the within year (seasonal) effect over time statistically significant?</em> This is the question Iāll try to answer, or at least show how to answer, now.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
well, one of the elephants; I also wasnāt happy with the AR(7) for the residuals<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
<p>
I ended the <a href="/2015/11/21/climate-change-and-spline-interactions/">last post</a> with some pretty plots of air temperature change within and between years in the <a href="http://www.metoffice.gov.uk/hadobs/hadcet/">Central England Temperature series</a>. The elephant in the room<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> at the end of that post was <em>is the change in the within year (seasonal) effect over time statistically significant?</em> This is the question Iāll try to answer, or at least show how to answer, now.
</p>
<p>
The model I fitted in the last post was
</p>
<p>
[ y = _0 + f(x_1, x_2) + , N(0, ^2) ]
</p>
<p>
and allowed, as we saw, for the within year spline/effect to vary smoothly with the trend or between year effect. Answering our scientific question require that we determine whether the spline interaction model (above) fits the data significantly better than the additive model
</p>
<p>
[ y = <em>0 + f</em>{}(x_1) + f_{}(x_2) + , N(0, ^2) ]
</p>
<p>
which has a fixed seasonal effect?
</p>
<p>
The model we ended up with was the spline interaction with an AR(7) in the residuals. To catch you up, the chunk below loads the CET data and fits the model we were left with at the end of the <a href="/2015/11/21/climate-change-and-spline-interactions/">previous post</a>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: nlme
This is mgcv 1.8-9. For overview type 'help("mgcv-package")'.</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: methods</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">source</span><span class="p">(</span><span class="n">con</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">url</span><span class="p">(</span><span class="s2">"http://bit.ly/loadCET"</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"libcurl"</span><span class="p">))</span><span class="w">
</span><span class="n">close</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span><span class="n">cet</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loadCET</span><span class="p">()</span><span class="w">
</span><span class="c1">## need a list with gamm default for verbose output</span><span class="w">
</span><span class="n">ctrl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">niterEM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">optimMethod</span><span class="o">=</span><span class="s2">"L-BFGS-B"</span><span class="p">,</span><span class="w"> </span><span class="n">maxIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">msMaxIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="c1">## knots - see previous post</span><span class="w">
</span><span class="n">knots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">),</span><span class="w"> </span><span class="m">12.5</span><span class="p">))</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">te</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cr"</span><span class="p">,</span><span class="s2">"cc"</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="m">12</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">))</span></code></pre>
</figure>
<p>
To answer our question, we want to fit the following two pseudo-code models and compare them using a likelihood ratio test
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">foo</span><span class="p">)</span><span class="w">
</span><span class="n">m0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">foo</span><span class="p">)</span><span class="w">
</span><span class="n">anova</span><span class="p">(</span><span class="n">m1</span><span class="p">,</span><span class="w"> </span><span class="n">m0</span><span class="p">)</span></code></pre>
</figure>
<p>
As is often the case in the real world, things arenāt quite so simple; there are several issues we need to take care of if we are going to really be testing nested models and the smooth terms that weāre interested in, specifically we need to
</p>
<ol type="1">
<li>
ensure that the models really are nested models,
</li>
<li>
fit using maximum likelihood (<code>method = āMLā</code>) not residual maximum likelihood (<code>method = āREMLā</code>) because the two models have different <em>fixed</em> effects
</li>
<li>
fit the same AR(7) process in the residuals in both models.
</li>
</ol>
<p>
To compare additive models we really want to ensure that the fixed effects parts are properly nested and appropriate for an ANOVA-like decomposition of <em>main</em> effects and <em>interactions</em>. <strong>mgcv</strong> provides a very simple way to achieve this via a tensor product interaction smooth and the <code>ti()</code> function. <code>ti()</code> smooths are created in the same way as the <code>te()</code> smooth we encountered in the last post, but unlike <code>te()</code>, <code>ti()</code> smooths do <em>not</em> incorporate the main effects of the terms involved in the smooth. It is further assumed therefore that you have included the main effects smooths in the model formula.
</p>
<p>
Hence we can now fit models like
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">x2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ti</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">)</span></code></pre>
</figure>
<p>
and be certain that the <code>s(x1)</code> and <code>s(x2)</code> terms in each model are equivalent. Note that you can use <code>s()</code> or <code>ti()</code> for these main effects components; if you have a single variable involved in a <code>ti()</code> term you get the main effect. Iām going to use <code>s()</code> in the code below, because I had better experience fitting the <code>gamm()</code> models weāre using with <code>s()</code> rather than <code>ti()</code> main effects.
</p>
<p>
Fitting with maximum likelihood instead of residual maximum likelihood is just a simple matter of using <code>method = āMLā</code> in the <code>gamm()</code> call.
</p>
<p>
The last thing we need to fix before we proceed is making sure that the main effects model and the main effects plus interaction model both incorporate the same AR(7) process that we fitted originally and which we refitted here earlier as <code>m</code>. To achieve this, we need to supply the AR coefficients to <code>corARMA()</code> when fitting our decomposed models, and indicate that <code>gamm()</code> (well, the underlying <code>lme()</code> code) shouldnāt try to estimate any of the parameters for the AR(7) process.
</p>
<p>
We can access the AR coefficients of <code>m</code> through the <code>intervals()</code> extractor functions and a little bit of digging. In the chunk below I store the AR(7) coefficients in the object <code>phi</code>. Now when fitting the <code>gamm()</code> models we have to pass <code>value = phi, fixed = TRUE</code> to the <code>corARMA()</code> bits of the model call to have it use the supplied coefficients instead of estimating a new set.
</p>
<p>
We are now ready to fit our two models to test whether the interaction smooth is required
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">phi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unname</span><span class="p">(</span><span class="n">intervals</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">which</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"var-cov"</span><span class="p">)</span><span class="o">$</span><span class="n">corStruct</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">m1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cr"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cc"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ti</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cr"</span><span class="p">,</span><span class="s2">"cc"</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ML"</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">))</span><span class="w">
</span><span class="n">m0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cr"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cc"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"ML"</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">))</span></code></pre>
</figure>
<p>
The <code>anova()</code> method is used to compared the fitted models
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">anova</span><span class="p">(</span><span class="n">m0</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m1</span><span class="o">$</span><span class="n">lme</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Model df AIC BIC logLik Test L.Ratio p-value
m0$lme 1 5 14750.9 14782.70 -7370.449
m1$lme 2 7 14706.0 14750.52 -7346.001 1 vs 2 48.89479 <.0001</code></pre>
</figure>
<p>
There is clear support for <code>m1</code> the model that allows for the seasonal smooth to vary as a smooth function of the trend over the model with additive effects.
</p>
<p>
What does our model say about the change in monthly temperature over the past century? Below I simply predict the temperature for each month in 1914 and 2014 and then compute the difference between years.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">cet</span><span class="p">,</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1914</span><span class="p">,</span><span class="w"> </span><span class="m">2014</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">),</span><span class="w">
</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)))</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">)</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">fYear</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">Year</span><span class="p">))</span><span class="w">
</span><span class="n">dif</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">Month</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w">
</span><span class="n">Difference</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">[</span><span class="n">Year</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2014</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">fitted</span><span class="p">[</span><span class="n">Year</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1914</span><span class="p">]))</span></code></pre>
</figure>
<p>
A plot of the temperature differences<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> is shown below, being produced by the following code
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ggplot</span><span class="p">(</span><span class="n">dif</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Difference</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Month</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">difference</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">degree</span><span class="o">*</span><span class="n">C</span><span class="p">),</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># minimal theme</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="c1"># tweak where the x-axis ticks are</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">month.abb</span><span class="p">,</span><span class="w"> </span><span class="c1"># & with what labels</span><span class="w">
</span><span class="n">minor_breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1.2</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.1</span><span class="p">),</span><span class="w">
</span><span class="n">minor_breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/are-some-seasons-warming-more-than-others-plot-1-1.png" alt="Difference in monthly temperature predictions between 1914 and 2014" />
<figcaption>
Difference in monthly temperature predictions between 1914 and 2014
</figcaption>
</figure>
<p>
Most months have seen at least ~0.5Ā°C increase in mean temperature between 1914 and 2014, with October and November both experiencing over a degree of warming over the period.
</p>
<p>
Before I finish, it is instructive to look at what the <code>ti()</code> term in the decomposed model looks like and represents
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">layout</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="n">op</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">par</span><span class="p">(</span><span class="n">mar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">m1</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">pers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">op</span><span class="p">)</span><span class="w">
</span><span class="n">layout</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/are-some-seasons-warming-more-than-others-plot-smooths-1.png" alt="Smooths for the spline interaction model including a tensor product interaction smooth" />
<figcaption>
Smooths for the spline interaction model including a tensor product interaction smooth
</figcaption>
</figure>
<p>
The first two terms are the overall trend and seasonal cycle respectively. The third term, shown as a perspective plot, is the tensor production interaction term. This term reflects the amount by which the fitted temperature is adjusted from the overall trend and seasonal cycle for any combination of month and year.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
well, one of the elephants; I also wasnāt happy with the AR(7) for the residuals<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
If I was being more thorough, I could use the prediction matrix feature of <code>gam()</code> models to put approximate confidence intervals on these differences.<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Climate change and spline interactions
Gavin L. Simpson
2015-11-21T00:00:00-06:00
2015-11-21T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/11/21/climate-change-and-spline-interactions/
<p>
In a series of irregular posts<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> Iāve looked at how additive models can be used to fit non-linear models to time series. Up to now Iāve looked at models that included a single non-linear trend, as well as a model that included a within-year (or seasonal) part and a trend part. In this trend <em>plus</em> season model it is important to note that the two terms are purely additive; no matter which January you are predicting for in a long timeseries, the seasonal effect for that month will always be the same. The trend part might shift this seasonal contribution up or down a bit, but all Januaryās are the same. In this post I want to introduce a different type of spline interaction model that will allow us to relax this additivity assumption and fit a model that allows the seasonal part of the model to change in time along with the trend.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">here</a>, <a href="/2011/07/21/smoothing-temporally-correlated-data/">here</a>, and <a href="/2014/05/09/modelling-seasonal-data-with-gam/">here</a><a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
<p>
In a series of irregular posts<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> Iāve looked at how additive models can be used to fit non-linear models to time series. Up to now Iāve looked at models that included a single non-linear trend, as well as a model that included a within-year (or seasonal) part and a trend part. In this trend <em>plus</em> season model it is important to note that the two terms are purely additive; no matter which January you are predicting for in a long timeseries, the seasonal effect for that month will always be the same. The trend part might shift this seasonal contribution up or down a bit, but all Januaryās are the same. In this post I want to introduce a different type of spline interaction model that will allow us to relax this additivity assumption and fit a model that allows the seasonal part of the model to change in time along with the trend.
</p>
<p>
As with previous posts, Iāll be using the Central England Temperature time series as an example. The data require a bit of processing to get them into a format useful for modelling, so Iāve written a <a href="https://gist.github.com/gavinsimpson/526ae3e1b02d333d85e4">little function</a> ā <code>loadCET()</code> ā that downloads the data and processes it for you. To load the function into R, run the following
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">source</span><span class="p">(</span><span class="n">con</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">url</span><span class="p">(</span><span class="s2">"http://bit.ly/loadCET"</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"libcurl"</span><span class="p">))</span><span class="w">
</span><span class="n">close</span><span class="p">(</span><span class="n">con</span><span class="p">)</span><span class="w">
</span><span class="n">cet</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loadCET</span><span class="p">()</span></code></pre>
</figure>
<p>
We also need a couple of packages for model fitting and plotting
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: nlme
This is mgcv 1.8-9. For overview type 'help("mgcv-package")'.</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"ggplot2"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: methods</code></pre>
</figure>
<p>
OK, letās beginā¦
</p>
<p>
As previously, if we think about a time series where observations were made on a number of occasions within any given year over a number of years, we may want to model the following features of the data
</p>
<ol type="1">
<li>
any trend or long term change in the level of the time series, and
</li>
<li>
any seasonal or within-year variation, and
</li>
<li>
any variation in, or interaction between, the trend and seasonal features of the data.
</li>
</ol>
<p>
In a <a href="/2014/05/09/modelling-seasonal-data-with-gam/">previous post</a> I tackled features <em>1</em> and <em>2</em>, but it is feature <em>3</em> that is of interest now. Our model for features <em>1</em> and <em>2</em> was
</p>
<p>
[ y = <em>0 + f</em>{}(x_1) + f_{}(x_2) + , N(0, ^2) ]
</p>
<p>
where (<em>0) is the intercept, (f</em>{}) and (f_{}) are smooth functions for the seasonal and trend features weāre interested in, and (x_1) and (x_2) are to covariate data providing some form of time indicators for the within-year and between year times.
</p>
<p>
To allow for an interaction between (f_{}) and (f_{}) we will need to fit the following modle instead
</p>
<p>
[ y = _0 + f(x_1, x_2) + , N(0, ^2) ]
</p>
<p>
Notice now that (f()) is a smooth function of our two time variables, and for simplicityās sake letās say that the within-year variable will just be the numeric month indicator (1, 2, ā¦, 12) and the between year variable will be the calendar year of the observation. In previous posts Iāve used a derived time variable instead of calendar year for the trend, but doing that here is largely redundant; the data seem well modelled even if we donāt allow for a trend within-year, and doing some useful or interesting things with the model once fitted is much simplified if we just use observation year for the trend.
</p>
<p>
In pseudo <strong>mgcv</strong> code we are going to fit the following model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gam</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">te</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span><span class="w"> </span><span class="n">x2</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">foo</span><span class="p">)</span></code></pre>
</figure>
<p>
The <code>te()</code> represents a tensor product smooth of the indicated variables. We wonāt be using <code>s()</code> because our two time variables are unrelated, and we want to allow for more variation in one of the variables than the other; multivariate <code>s()</code> smooths are isotropic, so theyāre good for things like spatial coordinates but not things measured in different units or having more variation in one variable than the other. Iām not going to go into the detail of tensor product smooths; thatās covered in Simon Woodās <a href="https://www.crcpress.com/Generalized-Additive-Models-An-Introduction-with-R/Wood/9781584884743">rather excellent book</a>.
</p>
<p>
Another detail that we need to consider is knot placement. Previously I used a cyclic spline for the within-year term and allowed <code>gam()</code> to select the knots for the spline from the data. This meant that boundary knots were at months 1 and 12. This worked ok where Iāve been modelling daily data so the within-year term is in Julian day say, as the knots would be at 1 and 366 and it didnāt matter much if December 31<sup>st</sup> was <em>exactly</em> the same as January 1<sup>st</sup>. But with monthly data like this it is a bit of a problem; we donāt expect December and January to be <em>exactly</em> the same. This problem was <a href="http://www.fromthebottomoftheheap.net/2014/05/09/modelling-seasonal-data-with-gam/#comment-1964880067">anticipated</a> in the comments of the previous post by a reader and I sort of dismissed it. Well, I was wrong and it took me until I set about interrogating the model that Iāll fitcshortly to realise it.
</p>
<p>
What we need to do is place boundary knots just beyond the data, such that the distance between December and January is the same as the distance between any other month. Placing boundary knots at (0.5, 12.5) achieves this. We then have 10 more interior knots to play with (assuming 12 knots overall, which is what I specify for <code>k</code> below), so I just place those, spread evenly between 1 and 12 (the inner <code>seq()</code> call).
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">knots</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">),</span><span class="w"> </span><span class="m">12.5</span><span class="p">))</span></code></pre>
</figure>
<p>
Having dealt with those details, we can fit some models; here I fit models with the same fixed effects parts (the spline interaction) but with differing stochastic trend models in the residuals.
</p>
<p>
To assist our selection of the stochastic model in the residuals, we fit a naive model that assumes independence of observations
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">te</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cr"</span><span class="p">,</span><span class="s2">"cc"</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="m">12</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">)</span></code></pre>
</figure>
<p>
Plotting the autocorrelation function (ACF) of the normalized residuals from the <code><span class="math inline">\(lme</code> part of this model fit we can start to think about plausible models for the residuals. Remember though that we are going to nest this within-year, so weāre only going to be able to do anything about the first 12 lags even though Iāll still show the default number</p> <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">acf</span><span class="p">(</span><span class="n">resid</span><span class="p">(</span><span class="n">m0</span><span class="o">\)</span></span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">ānormalizedā</span><span class="p">)))</span></code>
</pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/climate-change-and-spline-interactions-plot-naive-acf-1.png" alt="ACF for model m0 a naive additive model assuming conditional independence of observations fitted to the CET time series" />
<figcaption>
ACF for model <code>m0</code> a naive additive model assuming conditional independence of observations fitted to the CET time series
</figcaption>
</figure>
<p>
In the ACF we see lingering correlations out to lag 7 or 8 and then longer-range lags out beyond a year. These latter lags are the between-year temporal signal that we arenāt capturing perfectly with the temporal trend component of the model fit. Weāre going to ignore these, for now at least ā I may return to look at these in a future post.
</p>
<p>
From the ACF (and a bit of fiddling, errā¦ EDA) it looks like AR terms are needed to model this residual autocorrelation. Hence the stochatsic trend models are AR(<em>p</em>), for <em>p</em> in {1, 2, ā¦, 8}. The ARMA is nested within year, as previously; with the switch to modelling using calendar year for the trend term, I would anticipate stronger within year autocorrelation in residuals, or possible a more complex structure, than observed in earlier fits<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>.
</p>
<p>
If you want to fit all the models great, Iāll get to you in a moment ā just donāt look at the value of <code>p</code> in the chunk below! If you just want to skip ahead, fit the following model and then move right along to the <a href="#nextsection">next section</a>, thus saving yourself in the region of 10 minutes (on a fast as hell Xeon workstation) of thumb twiddling
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ctrl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">niterEM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">optimMethod</span><span class="o">=</span><span class="s2">"L-BFGS-B"</span><span class="p">,</span><span class="w"> </span><span class="n">maxIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">msMaxIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">te</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cr"</span><span class="p">,</span><span class="s2">"cc"</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="m">12</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">))</span></code></pre>
</figure>
<p>
For those of you in for the long haul, hereās a loop<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> that will fit the models with varying AR terms for us
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ctrl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">niterEM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">optimMethod</span><span class="o">=</span><span class="s2">"L-BFGS-B"</span><span class="p">,</span><span class="w"> </span><span class="n">maxIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">msMaxIter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">8</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">te</span><span class="p">(</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"cr"</span><span class="p">,</span><span class="s2">"cc"</span><span class="p">),</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="m">12</span><span class="p">)),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"REML"</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">,</span><span class="w"> </span><span class="n">knots</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">knots</span><span class="p">,</span><span class="w">
</span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">))</span><span class="w">
</span><span class="n">assign</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"m"</span><span class="p">,</span><span class="w"> </span><span class="n">i</span><span class="p">),</span><span class="w"> </span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
A generalised likelihood ratio test can be used to test for which correlation structure fits best
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">anova</span><span class="p">(</span><span class="n">m1</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m2</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m3</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m4</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m5</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m6</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m7</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">m8</span><span class="o">$</span><span class="n">lme</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> Model df AIC BIC logLik Test L.Ratio p-value
m1$lme 1 6 14849.98 14888.13 -7418.988
m2$lme 2 7 14836.78 14881.29 -7411.389 1 vs 2 15.197206 0.0001
m3$lme 3 8 14810.73 14861.60 -7397.365 2 vs 3 28.047345 <.0001
m4$lme 4 9 14784.63 14841.86 -7383.314 3 vs 4 28.101617 <.0001
m5$lme 5 10 14778.35 14841.95 -7379.177 4 vs 5 8.275739 0.0040
m6$lme 6 11 14776.49 14846.44 -7377.244 5 vs 6 3.865917 0.0493
m7$lme 7 12 14762.45 14838.77 -7369.227 6 vs 7 16.032363 0.0001
m8$lme 8 13 14764.33 14847.01 -7369.167 7 vs 8 0.119909 0.7291</code></pre>
</figure>
<p>
Lo and behold, the AR(7) turns out to have the best fit as assessed by a range of metrics. If we now look at the ACF of the normalized residuals for this model we see that all the within-year autocorrelation has been accounted for, leaving a little bit of correlation at lags just longer than a year.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">acf</span><span class="p">(</span><span class="n">resid</span><span class="p">(</span><span class="n">m7</span><span class="o">$</span><span class="n">lme</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normalized"</span><span class="p">)))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/climate-change-and-spline-interactions-plot-best-acf-1.png" alt="ACF for model m7 an additive model with an AR(7) process in the residuals fitted to the CET time series" />
<figcaption>
ACF for model <code>m7</code> an additive model with an AR(7) process in the residuals fitted to the CET time series
</figcaption>
</figure>
<p>
At this stage we can probably proceed without too much worry ā although an AR(7) is quite a complex model to fit, so we should remain a little cautious.
</p>
<p>
Before we move on, to bring us up to speed with the people that jumped ahead, copy <code>m7</code> into object <code>m</code> so the code in the next section works for you too.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">m7</span></code></pre>
</figure>
<h2 id="nextsection">
Interrogating the fitted model
</h2>
<p>
Iām going to cut to the chase and look at the fitted model and use it to ask some questions about how temperature has changed both within and between years over the last 100 years. In part 2 of this post Iāll look at doing inference on the fitted model, but for now Iāll skip that.
</p>
<p>
First, letās visualise the fitted spline; this requires a 3D plot so it gets somewhat tricky to really see whatās going on, but here goes
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">pers</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/climate-change-and-spline-interactions-plot-gam-1.png" alt="Fitted bivariate spline" />
<figcaption>
Fitted bivariate spline
</figcaption>
</figure>
<p>
This is quite a useful visualisation as it illustrates how the model represents longer term trends, seasonal cycles, and how these vary in relation to one another. Viewed one way, we have estimates of trends over years for each month. Alternatively, we could see the model as giving an estimate of the seasonal cycle for each year. Each year can have a different seasonal cycle and each month a different trend. If there was no interaction, there would be no change in the seasonal pattern other time ā or all months would have the same trend over years. This figure also sucks; itās 3D but static and the scale of the trend and any change in seasonal cycle over time is swamped by the magnitude of the seasonal cycle itself.
</p>
<h3 id="predict-monthly-temperature-for-the-years-1914-and-2014">
Predict monthly temperature for the years 1914 and 2014
</h3>
<p>
In the first illustrative use of the fitted model, Iāll predict within-year temperatures for two years ā 1914 and 2014 ā to look at how different the seasonal cycle is after a 100 years<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a> of climate change (time). The first step is to produce the values of the covariates that we want to predict at. In the snippet below I generate 100 <code>1914</code>s followed by 100 <code>2014</code>s for <code>Year</code>, and within these years we have 100 evenly-spaced values on the interval (1,12) for <code>nMonth</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">cet</span><span class="p">,</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1914</span><span class="p">,</span><span class="w"> </span><span class="m">2014</span><span class="p">),</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">),</span><span class="w">
</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">),</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)))</span></code></pre>
</figure>
<p>
Next, the <code>predict()</code> method generates predicted values for the new data pairs, with standard errors for each predicted value
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">crit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qt</span><span class="p">(</span><span class="m">0.975</span><span class="p">,</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">gam</span><span class="p">))</span><span class="w"> </span><span class="c1"># ~95% interval critical t</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred</span><span class="o">$</span><span class="n">se.fit</span><span class="p">,</span><span class="w"> </span><span class="n">fYear</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.factor</span><span class="p">(</span><span class="n">Year</span><span class="p">))</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w">
</span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre>
</figure>
<p>
The first <code>transform()</code> adds <code>fitted</code>, <code>se</code>, and <code>fYear</code> variables to <code>pdat</code> for the predictions, their standard errors, and a factor for <code>Year</code> that Iāll use in plotting shortly. The second <code>transform()</code> call adds <code>upper</code> and <code>lower</code> variables containing the upper and lower <em>pointwise</em> confidence bounds, here for an approximate 95% interval.
</p>
<p>
A plot, using the <strong>ggplot2</strong> package, of the predicted monthly temperatures for 1914 and 2014 is created in the next chunk. Itās a little involved as I wanted to modify a few things and change the name of the legend to make it look nice ā Iāve commented the lines to indicate what they do
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fYear</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_ribbon</span><span class="p">(</span><span class="n">mapping</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">ymin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lower</span><span class="p">,</span><span class="w"> </span><span class="n">ymax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">upper</span><span class="p">,</span><span class="w">
</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fYear</span><span class="p">),</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># confidence band</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fYear</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># predicted temperatures</span><span class="w">
</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># minimal theme</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"top"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># push legend to the top</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">degree</span><span class="o">*</span><span class="n">C</span><span class="p">)),</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># correct legend name</span><span class="w">
</span><span class="n">scale_colour_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="c1"># tweak where the x-axis ticks are</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">month.abb</span><span class="p">,</span><span class="w"> </span><span class="c1"># & with what labels</span><span class="w">
</span><span class="n">minor_breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w">
</span><span class="n">p1</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/climate-change-and-spline-interactions-predict-plot-1-1.png" alt="Predicted monthly temperature for 1914 and 2014" />
<figcaption>
Predicted monthly temperature for 1914 and 2014
</figcaption>
</figure>
<p>
Looking at the plot, most of the action appears in the autumn and winter months.
</p>
<h3 id="predict-trends-for-each-month-19142014">
Predict trends for each month, 1914ā2014
</h3>
<p>
The second use of the fitted model will be to predict trends in temperature for each month over the period 1914ā2014. For this we need a different set of new values to predict at than before; here I repeat the values 1914ā2012 twelve times each and the sequence 1, 2, ā¦, 12 101 times, once per year of the period of interest.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pdat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">cet</span><span class="p">,</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">Year</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1914</span><span class="o">:</span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">),</span><span class="w">
</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">101</span><span class="p">)))</span></code></pre>
</figure>
<p>
Next we repeat the earlier steps to predict from the model and set up an object for plotting with <code>ggplot()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pred2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat2</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">## add predictions & SEs to the new data ready for plotting</span><span class="w">
</span><span class="n">pdat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat2</span><span class="p">,</span><span class="w">
</span><span class="n">fitted</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred2</span><span class="o">$</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="c1"># predicted values</span><span class="w">
</span><span class="n">se</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pred2</span><span class="o">$</span><span class="n">se.fit</span><span class="p">,</span><span class="w"> </span><span class="c1"># standard errors</span><span class="w">
</span><span class="n">fMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="nb">month.abb</span><span class="p">[</span><span class="n">nMonth</span><span class="p">],</span><span class="w"> </span><span class="c1"># month as a factor</span><span class="w">
</span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">month.abb</span><span class="p">))</span><span class="w">
</span><span class="n">pdat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat2</span><span class="p">,</span><span class="w">
</span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">),</span><span class="w"> </span><span class="c1"># upper and...</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se</span><span class="p">))</span><span class="w"> </span><span class="c1"># lower confidence bounds</span></code></pre>
</figure>
<p>
The first plot weāll produce using these data is a plot of the trends faceted by <code>fMonth</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">pdat2</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fMonth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fMonth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># draw trend lines</span><span class="w">
</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># minimal theme</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"none"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># no legend</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">degree</span><span class="o">*</span><span class="n">C</span><span class="p">)),</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="w"> </span><span class="n">fMonth</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># facet on month</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">17</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">minor_breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="c1"># nicer ticks</span><span class="w">
</span><span class="n">p2</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/climate-change-and-spline-interactions-predict-plot-2-1.png" alt="Predicted trends in monthly temperature, 1914ā2014." />
<figcaption>
Predicted trends in monthly temperature, 1914ā2014.
</figcaption>
</figure>
<p>
The impression that most of the action is in the autumn and winter is again very apparent.
</p>
<h3 id="predict-trends-for-each-month-19142014-by-quarter">
Predict trends for each month, 1914ā2014, by quarter
</h3>
<p>
Another visualisation of the same predictions is to group the data by quarter/season. For that we set up a variable <code>Quarter</code> in the <code>pred2</code> data frame and assign particular months to the seasons.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">pdat2</span><span class="o">$</span><span class="n">Quarter</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NA</span><span class="w">
</span><span class="n">pdat2</span><span class="o">$</span><span class="n">Quarter</span><span class="p">[</span><span class="n">pdat2</span><span class="o">$</span><span class="n">nMonth</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">12</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">)]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Winter"</span><span class="w">
</span><span class="n">pdat2</span><span class="o">$</span><span class="n">Quarter</span><span class="p">[</span><span class="n">pdat2</span><span class="o">$</span><span class="n">nMonth</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="m">3</span><span class="o">:</span><span class="m">5</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Spring"</span><span class="w">
</span><span class="n">pdat2</span><span class="o">$</span><span class="n">Quarter</span><span class="p">[</span><span class="n">pdat2</span><span class="o">$</span><span class="n">nMonth</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="m">6</span><span class="o">:</span><span class="m">8</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Summer"</span><span class="w">
</span><span class="n">pdat2</span><span class="o">$</span><span class="n">Quarter</span><span class="p">[</span><span class="n">pdat2</span><span class="o">$</span><span class="n">nMonth</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="m">9</span><span class="o">:</span><span class="m">11</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Autumn"</span><span class="w">
</span><span class="n">pdat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat2</span><span class="p">,</span><span class="w">
</span><span class="n">Quarter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">Quarter</span><span class="p">,</span><span class="w">
</span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Spring"</span><span class="p">,</span><span class="s2">"Summer"</span><span class="p">,</span><span class="s2">"Autumn"</span><span class="p">,</span><span class="s2">"Winter"</span><span class="p">)))</span></code></pre>
</figure>
<p>
Then we facet on <code>Quarter</code>, and we need a legend to help identify the months, we do a little fiddling to get a nice name
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">pdat2</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fitted</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fMonth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fMonth</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># draw trend lines</span><span class="w">
</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># minimal theme</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">legend.position</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"top"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># legend on top</span><span class="w">
</span><span class="n">scale_fill_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="c1"># nicer legend title</span><span class="w">
</span><span class="n">scale_colour_discrete</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">degree</span><span class="o">*</span><span class="n">C</span><span class="p">)),</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">NULL</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_grid</span><span class="p">(</span><span class="n">Quarter</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="n">scales</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"free_y"</span><span class="p">)</span><span class="w"> </span><span class="c1"># facet by Quarter</span><span class="w">
</span><span class="n">p3</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/climate-change-and-spline-interactions-predict-plot-3-1.png" alt="Predicted trends in monthly temperature, 1914ā2014, by quarter." />
<figcaption>
Predicted trends in monthly temperature, 1914ā2014, by quarter.
</figcaption>
</figure>
<h2 id="summary">
Summary
</h2>
<p>
In this post Iāve looked at how we can fit smooth models with smooth interactions between two variables. This allows the smooth effect one variable to vary as a smooth function of the second variable. This approach can be extended to additional variables as needed.
</p>
<p>
One of the things Iām not very happy with is the rather complex AR process in the model residuals. The AR(7) mopped up all the within-year residual autocorrelation but it appears that there is a trade-off here between fitting a more complex seasonal smooth or a more complex within-year AR process.
</p>
<p>
An important aspect that I havenāt covered in this post is whether the interaction model is an improvement in fit over a purely additive model of a trend in temperature with the same seasonal cycle superimposed. Iāll look at how we can do that in part 2.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">here</a>, <a href="/2011/07/21/smoothing-temporally-correlated-data/">here</a>, and <a href="/2014/05/09/modelling-seasonal-data-with-gam/">here</a><a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
Note that this code assumes that samples are provided in the data in their time order <em>within</em> year. This is the case here, but if it isnāt, you could do <code>form = ~ nMonth | Year</code> to tell <code>gamm()</code> about the correct ordering.<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn3">
<p>
Iām just being lazy; I could fit these models in parallel with the <strong>parallel</strong> package, but Iām caching this code chunk so, mehā¦<a href="#fnref3" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn4">
<p>
Yes, yes, yes, I know itās 101 yearsā¦<a href="#fnref4" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
User-friendly scaling
Gavin L. Simpson
2015-10-08T00:00:00-06:00
2015-10-08T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/10/08/user-friendly-scaling/
<p>
Back in the mists of time, whilst programming early versions of Canoco, Cajo ter Braak decided to allow users to specify how species and site ordination scores were scaled relative to one another via a simple numeric coding system. This was fine for the DOS-based software that Canoco was at the time; you entered <code>2</code> when prompted and you got <em>species</em> scaling, <code>-1</code> got you <em>site</em> or <em>sample</em> scaling <strong>and</strong> Hillās scaling or correlation-based scores depending on whether your ordination was a linear or unimodal method. This system persisted; even in the Windows era of Canoco these numeric codes can be found lurking in the <code>.con</code> files that describe the analysis performed. This use of numeric codes for scaling types was so pervasive that it was logical for Jari Oksanen to include the same system when the first <code>cca()</code> and <code>rda()</code> functions were written and in doing so Jari perpetuated one of the most frustrating things Iāve ever had to deal with as a user and teacher of ordination methods. But, as of last week, my frustration is no moreā¦
</p>
<p>
Back in the mists of time, whilst programming early versions of Canoco, Cajo ter Braak decided to allow users to specify how species and site ordination scores were scaled relative to one another via a simple numeric coding system. This was fine for the DOS-based software that Canoco was at the time; you entered <code>2</code> when prompted and you got <em>species</em> scaling, <code>-1</code> got you <em>site</em> or <em>sample</em> scaling <strong>and</strong> Hillās scaling or correlation-based scores depending on whether your ordination was a linear or unimodal method. This system persisted; even in the Windows era of Canoco these numeric codes can be found lurking in the <code>.con</code> files that describe the analysis performed. This use of numeric codes for scaling types was so pervasive that it was logical for Jari Oksanen to include the same system when the first <code>cca()</code> and <code>rda()</code> functions were written and in doing so Jari perpetuated one of the most frustrating things Iāve ever had to deal with as a user and teacher of ordination methods. But, as of last week, my frustration is no moreā¦
</p>
<p>
ā¦because we released a patch update to the CRAN version of <strong>vegan</strong>. Normally we donāt introduce new functionality in patch releases but the change I made to the way users can request ordination scores was pretty trivial and maintained backwards compatibility.
</p>
<p>
Previously, different scalings could be requested using the <code>scaling</code> argument. <code>scaling</code> is an argument of the <code>scores()</code> function; anything function using <code>scores()</code> would either have <code>scaling</code> as a formal argument too, or would pass <code>scaling</code> on to <code>scores()</code> internally. To date, the different scores were specified as per DOS-era Canoco as numeric values. Now, <code>scores()</code> accepts either those same old numeric values or a character string for <code>scaling</code> coupled with a second logical argument. <strong>Vegan</strong> accepts the following character values to select the type of scaling:
</p>
<ul>
<li>
<p>
<code>āsitesā</code>, which gives site-focussed scaling, equivalent to numeric value <code>1</code>
</p>
</li>
<li>
<p>
<code>āspeciesā</code> (the default), which gives species- (variable-) focused scaling, equivalent to numeric value <code>2</code>
</p>
</li>
<li>
<p>
<code>āsymmetricā</code>, which gives a so-called symmetric scaling, and is equivalent to numeric value <code>3</code>.
</p>
</li>
</ul>
<p>
To get negative versions of these values, the <code>correlation</code> or <code>hill</code> argument should be set to <code>TRUE</code> as follows
</p>
<ul>
<li>
<p>
<code>correlation</code> (default <code>FALSE</code>) for correlation-like scores for PCA/RDA/CAPSCALE models, or
</p>
</li>
<li>
<p>
<code>hill</code> (default <code>FALSE</code>) for Hillās scaling for CA/CCA models
</p>
</li>
</ul>
<p>
Whilst this requires the setting of two different arguments, itās certainly a lot easier to remember these two arguments than what the numerical codes mean.
</p>
<h3 id="obligatory-dutch-dune-meadows-example">
Obligatory Dutch dune meadows example
</h3>
<p>
Hereās a quick example of the new usage showing a PCA of the classic Dutch dune meadow data set.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"vegan"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: permute
Loading required package: lattice
This is vegan 2.3-1</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="p">(</span><span class="n">dune</span><span class="p">)</span><span class="w">
</span><span class="n">ord</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rda</span><span class="p">(</span><span class="n">dune</span><span class="p">)</span><span class="w"> </span><span class="c1"># fit the PCA</span><span class="w">
</span><span class="n">layout</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">ord</span><span class="p">,</span><span class="w"> </span><span class="n">scaling</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"species"</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">ord</span><span class="p">,</span><span class="w"> </span><span class="n">scaling</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"species"</span><span class="p">,</span><span class="w"> </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">layout</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/user-friendly-scaling-example-1.png" alt="PCA of the Dutch dune meadow data set. Both biplots are drawn using species scaling, but the one on the right standardizes the species scores." />
<figcaption>
PCA of the Dutch dune meadow data set. Both biplots are drawn using species scaling, but the one on the right standardizes the species scores.
</figcaption>
</figure>
<p>
The two biplots are based on the same underlying ordination and both focus the scaling on best representing the relationships between species (<code>scaling = āspeciesā</code>), but the biplot on the right uses correlation-like scores. This has the effect of making the species have equal representation on the plot without doing the PCA with standardized species data (all species having unit variance).
</p>
ESA's publishing deal with Wiley
Gavin L. Simpson
2015-08-11T00:00:00-06:00
2015-08-11T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/08/11/notes-from-esa-council-on-wiley-deal/
<p>
One of the big announcements about the society made by ESA in the run up to the annual meeting in Baltimore this week was the news that ESA has chosen to partner with John Wiley & Sons as publisher of the society journals. At the time of the announcement few details about the deal or the process by which this decision was made were available. I was attending the ESA Council as the incoming Chair of the Paleoecology Section where some further details were provided and members of Council were able to ask questions about the deal. These are my notes from that meeting.
</p>
<p>
One of the big announcements about the society made by ESA in the run up to the annual meeting in Baltimore this week was the news that ESA has chosen to partner with John Wiley & Sons as publisher of the society journals. At the time of the announcement few details about the deal or the process by which this decision was made were available. I was attending the ESA Council as the incoming Chair of the Paleoecology Section where some further details were provided and members of Council were able to ask questions about the deal. These are my notes from that meeting.
</p>
<h2 id="headlines">
Headlines
</h2>
<ul>
<li>
Deal brings financial stability for the society
</li>
<li>
Wiley will pay ESA a guaranteed royalty payment, annually
</li>
<li>
The deal allows for profit-sharing should Wiley increase subscriptions/income beyond some point (specifics not specified)
</li>
<li>
All ESA members will receive electronic access to the society journals, and membership dues will <em>not</em> increase (beyond usual annual 2% increase)
</li>
<li>
ESA members will have a print-on-demand option for those wanting hard copies
</li>
<li>
<em>Frontiers</em> will continue to be produced as a hard copy for all members
</li>
<li>
No hybrid Open Access option for papers (for now, Wiley & ESA will look at this going forward)
</li>
<li>
Page charges remain
</li>
<li>
<em>Ecosphere</em> charges (expected to) remain the same, full open access
</li>
<li>
Ecological Archives is likely to move to a new home and ESA and Wiley are currently discussing options for this though moving existing archives to Figshare is main option being explored
</li>
<li>
Somewhere in Ithaca there is a single computer running DOS(!) that performs a critical part of the current journal publishing platform used by ESAā¦
</li>
<li>
In contrast, Wileyās publishing platform, whatever you might think of Scholar One or ReadCube, is light years better than EcoTrackā¦
</li>
</ul>
<h2 id="some-detail">
Some detail
</h2>
<p>
Council were expected to vote to approve, or not, the budget as presented by the VP Finance. The slides presented to Council to facilitate this included financial details of the Wiley deal. I asked whether these were for public consumption and had it confirmed that the numbers were public. The payment to ESA from Wiley in the 2015ā16 budget is <strong>$1,350,357</strong>. This number includes
</p>
<ul>
<li>
The royalty payment
</li>
<li>
An amount to cover some of ESAs costs with the journals (details of what this involved & the amount were either lacking or I didnāt catch them)
</li>
</ul>
<p>
This number is only half what the income will be each year as ESAās financial year runs JulyāJune and hence the 2015ā16 budget includes half a year of ESA self-publishing and half a year with Wiley publishing. I confirmed that in 2016ā17 the payment from Wiley will be <strong>$2,700,714</strong> and income from publications from subscriptions and page charges will drop to 0 at the same time.
</p>
<p>
I didnāt fully get down in my notes how the expenses/costs due to publications would change in 2016ā17 and in the current year the picture is complicated because there are significant costs associated with migrating the journals to Wileyās platform. Therefore, I donāt know exactly what the anticipated āprofitā will be going forward. What is, I think, indicative is that the senior ESA staff and academics were clearly anticipating significant improvements in the āprofitā generated by the Societyās journals that can be directed towards activities the Society does on behalf of its members and its support for ecology.
</p>
<p>
There are still many details about the deal and the process that are not clear or not covered in the Council meeting. What was abundantly clear was that the people present that were involved in the Publications Transition Committee, the senior ESA staff, the President, were <strong>all</strong> clearly acting in the best interests of the Society when they set out to investigate options for the Societyās journals <em>and</em> when they made the decision to choose Wiley as publisher. From what I saw, the Committee has certainly secured the good financial stability of the Society for the immediate future.
</p>
The new Tri-agency open access policy
Gavin L. Simpson
2015-07-10T00:00:00-06:00
2015-07-10T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/07/10/tri-agency-open-access-policy/
<p>
Earlier this year the triumvirate of Canadian science funding bodies, the Natural Science and Engineering Research Council, the Canadian Institutes of Health Research (CIHR) and the Social Sciences and Humanities Research Council of Canada (SSHRC) (collectively referred to as the Tri-Agencies), announced their new policy of open access to research publications. This followed a period of consultation, begun in the fall of 2013, with the science communities funded by the Tri-Agencies. The policy came into effect, effectively, on May 1st this year (2015) and applies to all Tri-agency-funded grants awarded from May 1st 2015 onward. As part of its awareness programme for the policy, the Tri-Agencies have been holding webinars to explain the new policy and allow for questions from researchers. In the main the Tri-Agency policy is pretty clear, judging by the questions from academics during the webinar session that I attended recently, but we can conclude one or both of two things: i) academics donāt read things unless the absolutely must, and ii) that academics have some interesting views about open access, what it means for them, and what they consider as being good-practice or complying with the new rules. I was asked after tweeting about this to summarise my notes from the webinar and on the Tri-Agency policy on open access in general.
</p>
<p>
Earlier this year the triumvirate of Canadian science funding bodies, the Natural Science and Engineering Research Council, the Canadian Institutes of Health Research (CIHR) and the Social Sciences and Humanities Research Council of Canada (SSHRC) (collectively referred to as the Tri-Agencies), announced their new policy of open access to research publications. This followed a period of consultation, begun in the fall of 2013, with the science communities funded by the Tri-Agencies. The policy came into effect, effectively, on May 1st this year (2015) and applies to all Tri-agency-funded grants awarded from May 1st 2015 onward. As part of its awareness programme for the policy, the Tri-Agencies have been holding webinars to explain the new policy and allow for questions from researchers. In the main the Tri-Agency policy is pretty clear, judging by the questions from academics during the webinar session that I attended recently, but we can conclude one or both of two things: i) academics donāt read things unless the absolutely must, and ii) that academics have some interesting views about open access, what it means for them, and what they consider as being good-practice or complying with the new rules. I was asked after tweeting about this to summarise my notes from the webinar and on the Tri-Agency policy on open access in general.
</p>
<p>
In the main, the Tri-Agency <a href="http://www.science.gc.ca/default.asp?lang=En&n=F6765465-1">policy on open access</a> is pretty simple and to be fair to the Tri-Agencies, they have done a good job of providing information and <a href="http://www.science.gc.ca/default.asp?lang=En&n=A30EBB24-1">FAQs</a> that would probably answer most questions a general academic might have regarding the policy. That is if people would just read what the Tri-Agencies have written and stop trying to nit-pick or special-case their particular question. That said, there are some areas of the policy that are somewhat ambiguous and in other just plain missing.
</p>
<p>
But first, a summary of what the policy actually requires:
</p>
<ul>
<li>
Within 12 month of publication the peer-reviewed, final author-version of the manuscript <em>must</em> be freely available from the journal website or an approved institutional or disciplinary repository.
</li>
<li>
The policy applies to peer-reviewed research publications arising from Tri-Agency-funded grants awarded May 1st 2015 or later.
</li>
</ul>
<p>
The overriding issue here is ensuring that Canadians, be they members of the public, employees in industry, government officials, or academics, have access to the research outputs that the Canadian public have funded through their taxes. This is morally right; knowledge shouldnāt be locked away for the privileged few. But more than that it is the right thing to do economically and educationally for Canada. Industry canāt capitalise fully on the research paid for by Canadians if it is locked away behind exorbitant paywalls for example.
</p>
<p>
When the Tri-Agency speaks of research publications they exclusively mean peer-reviewed journal articles. So the policy doesnāt apply to research reports, monographs, book chapters, teaching materials etc. Just peer-reviewed papers. To be compliant, if you are in receipt of <em>new</em> research grant funding from one of the Tri-Agencies awarded May 1st 2015 onward, you are required to make freely available any peer-reviewed journal publications within 12 months of publication.
</p>
<p>
You can be compliant by following one of two routes
</p>
<ol type="1">
<li>
Deposit, within 12 months of publication, the final peer-reviewed (but not typeset) version of your manuscript in an approved repository. Iāll come back to what an approved repository is later, but ideally youād deposit the paper in your institutional repository, often run by your institutionās library, os a discipline-specific repository. This is the Green Open Access route. Or
</li>
<li>
Publish in an open access journal or a so-called hybrid journal that allows open access. This is the Gold Open Access route and provides immediate open and free access from the date of publication, but may require the payment of an article processing charge (APC). Note that not all journals charge APCs and that not all journals that do charge levy similarly-high APCs.
</li>
</ol>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/books-patrick-gothe-unsplash.jpg" alt="Institutional or disciplinary repositories are ideal places to deposit Green OA (Route 1) papers. (Source: Patrick Gƶthe, CC-0, Unsplash)" />
<figcaption>
Institutional or disciplinary repositories are ideal places to deposit Green OA (Route 1) papers. <br /> (Source: Patrick Gƶthe, CC-0, <a href="https://unsplash.com/photos/xiTFENI0dMY/download">Unsplash</a>)
</figcaption>
</figure>
<p>
Route 1 allows you to continue to publish in your traditional journals and doesnāt incur any additional cost to the researcher, but you do need to ensure that the journal you want to publish in doesnāt have an embargo longer than 12 months and that your institutionās repository meets the requirements of the publisher. Here Iām thinking of Elsevier and its retarded policies requiring various licensing terms or separate agreements. Your library staff can help you with this so go talk to them <em>before</em> you start writing your next paper as you may have to send it to a different journal to be compliant with the new policy. (You shouldnāt be publishing with Elsevier anyway, but that is a different story.)
</p>
<p>
Route 2 is my preferred option but despite what I and other OA activists will tell you about the majority of open access journals not charging APCs, in practice, the Gold route is going to cost you some money. How much money depends on where you want to publish; it will be upwards of US$3000 if you want to publish in traditional subscription journals that offer an open access option for example. Far cheaper options exist such as PLOS One, Scientific Reports, The PeerJ, to name but three. Critical points to note here though are
</p>
<ul>
<li>
You remain in charge of where you publish; journal choice is yours and yours alone (except for the embargo period!),
</li>
<li>
You donāt have to pay for Open Access; the Tri-Agencies <strong>do not</strong> require this of you, and
</li>
<li>
There is no <strong>additional</strong> funding from the Tri-Agencies to support the new policy
</li>
</ul>
<p>
As such, you remain largely in charge of where you can submit your papers for publication and you can choose route 1 (self-deposit) if you donāt want to or canāt afford to pay the APCs from your grant. You <em>are</em> allowed to pay APCs from your grant as an allowable expense, but youāll need to decide whether to pay them or use the money for something else.
</p>
<p>
What does <em>within 12 months of publication</em> mean? For the Route 1 option, the policy requires you to publish the paper within 12 months of the actual in-print publication date. The clock doesnāt start running according to the Tri-Agencies until your paper has been included in an issue. That means that online early publication doesnāt count towards the 12 month period.
</p>
<p>
What constitutes an approved repository? This is one area where the policy is necessarily unclear if you start to wander off piste. Your best bet is to use your Institutional Repository ā this supports your institution as well as in general being compliant with the policy ā or a discipline-based repository (PubMed Canada for example).Speak to the people, often the library staff, that run your institutionās repository for more information or visit the <a href="http://carl-abrc.ca/en/scholarly-communications/carl-institutional-repository-program.html">Canadian Association of Research Libraries Institutional Repository Project: Online Resource Portal</a> for further guidance. If your institution doesnāt have a repository, donāt worry as you can use one of <a href="http://www.carl-abrc.ca/en/scholarly-communications/canadian-ir-repositories/adoptive-repositories.html">eight adoptive repositories</a> run by academic institutions that will accept submissions from academics beyond their host institution.
</p>
<p>
You may also be able to use your favourite online repository, but note that the repository canāt require anyone to sign up for an account just to access papers (so Research Gate is out), and it is important to understand what the longevity of the repository and what its disaster plan are for submissions should it cease to operate or function. As there are so many of these online repositories, the Tri-Agencies cannot possibly provide an exhaustive list of approved ones, so they asked that people get in touch with the relevant Tri-Agency representative if you have questions about a specific repository.
</p>
<p>
Some common sense should help here; if you need to log in to read papers then the repository is not compatible with the policy. Putting your papers on your own website doesnāt constitute compliance either; Google may be pretty amazing at ferreting out resources on the web, but the Tri-Agencies take <em>discoverability</em> seriously so if you do put papers on your website, as I do here, you do also need to deposit them in an approved repository as well. If the repository is a commercial entity like Academia.edu, Research Gate, FigShare, etc then you do need to check closely what the publishers allow you to do in this regard <em>if you have signed over copyright to them</em>. Elsevier for example is requiring a non-commercial licence and for hosting on non-commercial terms which could rule out may of these repositories as potential venues for you paper even if the Tri-Agency considered them compliant.
</p>
<p>
Another area that the Tri-Agency policy is not clear on, is the licence terms that Gold Open Access (Route 2) papers should be made available under. The Agencies talk only about free availability but not freedom in terms of usage. Many publishers push researchers towards more restrictive licences (particularly with non-commercial clauses) rather than the accepted standard of the Creative Commons By Attribution (CC-BY) licence (or equivalent), which requires only that the original source and author be acknowledged. Academics should be wise to this and use the most permissive licence allowed (usually CC-BY) because other clauses, especially non-commercial ones, are a source of confusion (what does ācommercialā actually mean?!) and could exclude the wider benefits to Canadians, its industry and economy that the Tri-Agency Policy was developed to promote.
</p>
<p>
Some smaller points that cropped up during the webinar were:
</p>
<ul>
<li>
<strong>Compliance checking</strong>; at the moment you are required to comply with the policy upon your acceptance of funding. The Tri-Agencies are currently not doing any compliance checking but they are starting to think about what form such checks might take. I suspect it will be some years before compliance checking is common place, not least because the Tri-Agencies indicated theyād be consulting with the community on what form this should take also.
</li>
<li>
<strong>Data</strong>; apart for CIHR, the policy does not apply to data. Given the policy only applies to peer-reviewed research/journal publications should have been enough to cover this point but CIHR has long had a policy on Open Data so I guess NSERC and SSHRC needed to be explicit in this instance that they do not require Open Data. That said, Open Data is coming and I suspect it wonāt be that many years before NSERC and SSHRC start consulting on that too.
</li>
<li>
<strong>Grant applications and OA fees</strong>; you can include funds for APCs on your research grant proposals. The Tri-Agencies will be working with the review panels to make it clear that these are allowed. Your budget should be reasonable of course and reviewers will be asked to comment on that, but requesting funds for APCs will not count against your grant in the review stage. āReasonableā was the word that kept being used here.
</li>
<li>
<strong>NSERC Discovery grants</strong> announced but awarded prior to May 1st (in other words if you just received a Discovery Grant for the first time, or renewed one in the round just announced) are <strong>not</strong> covered by the new policy as they were awarded prior to May 1st 2015. I suppose this is because recipients applied for grants before details of the policy were announced so it is only fair to the recipients to not require compliance yet. Note also that the annual instalment of your Discovery Grant <strong>does not</strong> count as a new grant. So if you already have a Discovery Grant, that grant is not subject to the terms of the new policy. Only when you renew the Discovery Grant will you need to meet the requirements of the policy from that point on.
</li>
<li>
<strong>The policy is <em>not</em> retroactive</strong>; only new grants awarded May 1st 2015 or later will be subject to the policy.
</li>
<li>
<strong>The policy applies to <em>all</em> funded works</strong>, even if you are working with colleagues not covered by the policy. If you contribution is from a grant covered by it, you must comply with the policy.
</li>
</ul>
<p>
Finally, the <a href="http://www.carl-abrc.ca/en/scholarly-communications/resources-for-authors.html#addendum">SPARC Canadian Author Addendum</a> was mentioned several times. I donāt know why Canadian academics have their own specific version of the standard SPARC Author Addendum, but regardless, this is something you should use to retain your rights to your work. What normally happens when you agree with a publisher to publish your paper in a journal is that you sign over your copyright to the work to the publisher. This is starting to change with some publishers whereby you donāt transfer copyright you retain it and instead provide the publisher with an exclusive licence to publish the work. Both of these place restrictions on what you can and canāt do with your own research papers, but signing over copyright limits you the most.
</p>
<p>
What the SPARC Author Addendum does is provide you with a standard form that you return along side the copyright transfer or other agreement with the publisher, which indicates that the publisher, in agreeing to publish your paper, also agrees to your retaining some key rights, including but not restricted to
</p>
<ol type="1">
<li>
the right to reproduced the article for non-commercial purposes,
</li>
<li>
the right to prepare derivative works (i.e.Ā use figures in other works),
</li>
<li>
the right to allow others to reproduce the work under non-commercial terms.
</li>
</ol>
<p>
In returning the SPARC addendum, you also require that the publisher provide you with a PDF of the publisher version of record which is not encumbered by security or other DRM. Thereās even a clause in the addendum that says that if the publisher doesnāt respond to the addendum and publishes your paper, they have agreed to the terms presented in the Addendum.
</p>
<p>
Regardless of whether you now come under the Tri-Agency Open Access Policy or not, retaining rights using the SPARC Canadian Author Addendum should be something we do anyway.
</p>
<p>
Whatever you do now, if you are unsure about something regarding the Tri-Agency policy on open access, <a href="http://www.science.gc.ca/default.asp?lang=En&n=F6765465-1">go and read it</a> (it is short) and read through the <a href="http://www.science.gc.ca/default.asp?lang=En&n=A30EBB24-1">FAQs</a>. Speak to people at your institutionās library, and if you still have questions get in touch with the relevant member of the Tri-Agencies.
</p>
<p>
Even if you arenāt yet required to make your papers available under the terms of the policy, it would be a good use of time to start getting yourself and the other members of your lab into the habit of depositing research publications and familiarising yourselves with open access licences and what restrictions are in place for journals you traditionally publish in etc. <em>before</em> you are required to deposit your works.
</p>
<p>
If Iāve messed something up here or if anything remains unclear, Iāll do my best to answer any questions in the comments or correct the information stated above.
</p>
My aversion to pipes
Gavin L. Simpson
2015-06-03T00:00:00-06:00
2015-06-03T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/06/03/my-aversion-to-pipes/
<p>
At the risk of coming across as even more of a curmudgeonly old fart than people already think I am, I really do dislike the current vogue in R that is the pipe family of binary operators; e.g.Ā <code>%>%</code>. Introduced by Hadley Wickham and popularised and advanced via the <a href="http://cran.r-project.org/web/packages/magrittr/index.html"><strong>magrittr</strong> package</a> by Stefan Milton Bache, the basic idea brings the forward pipe of the F# language to R. At first, I was intrigued by the prospect and initial examples suggested this might be something I would find useful. But as time has progressed and Iāve seen the use of these pipes spread, Iāve grown to dislike the idea altogether. here I outline why.
</p>
<p>
At the risk of coming across as even more of a curmudgeonly old fart than people already think I am, I really do dislike the current vogue in R that is the pipe family of binary operators; e.g.Ā <code>%>%</code>. Introduced by Hadley Wickham and popularised and advanced via the <a href="http://cran.r-project.org/web/packages/magrittr/index.html"><strong>magrittr</strong> package</a> by Stefan Milton Bache, the basic idea brings the forward pipe of the F# language to R. At first, I was intrigued by the prospect and initial examples suggested this might be something I would find useful. But as time has progressed and Iāve seen the use of these pipes spread, Iāve grown to dislike the idea altogether. here I outline why.
</p>
<p>
The forward pipe operator is designed, in R at least (Iām not familiar with F#), to avoid the sort of nested/inline R code of the type shown below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">the_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">transform</span><span class="p">(</span><span class="n">subset</span><span class="p">(</span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'/path/to/data/file.csv'</span><span class="p">),</span><span class="w">
</span><span class="n">variable_a</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">x</span><span class="p">),</span><span class="w">
</span><span class="n">variable_c</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable_a</span><span class="o">/</span><span class="n">variable_b</span><span class="p">),</span><span class="w">
</span><span class="m">100</span><span class="p">)</span></code></pre>
</figure>
<p>
replacing that awful mess with
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">the_data</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'/path/to/data/file.csv'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">subset</span><span class="p">(</span><span class="n">variable_a</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">transform</span><span class="p">(</span><span class="n">variable_c</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable_a</span><span class="o">/</span><span class="n">variable_b</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="m">100</span><span class="p">)</span></code></pre>
</figure>
<p>
And when compared against one another like that, who wouldnāt rejoice at the prospect of a pipe to banish such awful R code to distant memory? The problem with this comparison though is, <em>who writes code like that in the first code block</em>? I donāt think Iāve <em>ever</em> written code like that, even when I was a very green useR around the turn of the century.
</p>
<p>
When you compare the pipe version with how Iād lay out the R code
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">the_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'/path/to/data/file.csv'</span><span class="p">)</span><span class="w">
</span><span class="n">the_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">subset</span><span class="p">(</span><span class="n">the_data</span><span class="p">,</span><span class="w"> </span><span class="n">variable_a</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">the_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">the_data</span><span class="p">,</span><span class="w"> </span><span class="n">variable_c</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable_a</span><span class="o">/</span><span class="n">variable_b</span><span class="p">)</span><span class="w">
</span><span class="n">the_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">the_data</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="c1"># I'm perplexed as to why this would be a good thing to do?</span></code></pre>
</figure>
<p>
the benefits of the pipe remain but they arenāt, at least in my opinion, as compelling. My version is verbose; I repeatedly overwrite <code>the_data</code> object with subsequent operations. Rather the writing <code>the_data</code> once in the pipe version, Iād write it 7 times! But that said, I could pass my version to a relative novice useR and theyād have a reasonable grasp of what the code did. I donāt think the same could be said for the pipe version.
</p>
<p>
But all that really doesnāt matter does it. Itās personal preference as to how you choose to write your data analysis and manipulation R script code. If you find it easier to write code and then read it back using the pipe operator all power to you.
</p>
<p>
Where I think it does make a difference is where you are
</p>
<ol type="1">
<li>
writing code to go into an R package for general consumption on say CRAN, or
</li>
<li>
writing example material for your package in a vignette or similar document.
</li>
</ol>
<p>
I donāt claim that these are the only problem areas nor that these are universally accepted. I wager Iām in the majority position at the moment, but that is probably down to the relatively recent arrival of the pipe on the R scene.
</p>
<p>
Why is the pipe a problem if you are writing code to go into a general purpose R package that you expect users to abuse with their own data in their own code? Two reasons. The pipe operator involves the <a href="http://adv-r.had.co.nz/Computing-on-the-language.html">standard non-standard evaluation</a> (NSE) paradigm. The pipe captures expressions on each side of the <code>%>%</code> operator and then arranges for the thing on the left of <code>%>%</code> to be injected into the expression on the right of <code>%>%</code>, usually as the first argument but not always. This all involves capturing the expressions and evaluating them within the <code>%>%()</code> function.
</p>
<p>
OK, isnāt that what all functions using a formula do, or what <code>transform()</code>, <code>subset()</code>, <em>et al</em> do? Well yes, and this is where my spider sense starts tingling. Who among us hasnāt had those things fail on us when we dropped them into an <code>lapply()</code> inside an anonymous function? Or wrapped those function as part of a package function only for some user to execute your function in a way you didnāt envisage? Now Hadley assures us that there is a correct way to do NSE and he even has a package for that, <a href="http://cran.r-project.org/web/packages/lazyeval/"><strong>lazyeval</strong></a>. But still I have my reservations, despite Stefanās attempts to allay my fears
</p>
<blockquote class="twitter-tweet" lang="en">
<p lang="en" dir="ltr">
<a href="https://twitter.com/ucfagls"><span class="citation" data-cites="ucfagls">(<span class="citeproc-not-found" data-reference-id="ucfagls"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/kevin_ushey"><span class="citation" data-cites="kevin_ushey">(<span class="citeproc-not-found" data-reference-id="kevin_ushey"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/JennyBryan"><span class="citation" data-cites="JennyBryan">(<span class="citeproc-not-found" data-reference-id="JennyBryan"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/noamross"><span class="citation" data-cites="noamross">(<span class="citeproc-not-found" data-reference-id="noamross"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/Voovarb"><span class="citation" data-cites="Voovarb">(<span class="citeproc-not-found" data-reference-id="Voovarb"><strong>???</strong></span>)</span></a> so far none have. Youāre welcome to reopen the github issue if you have examples.
</p>
ā Stefan Milton Bache (<span class="citation" data-cites="stefanbache">(<span class="citeproc-not-found" data-reference-id="stefanbache"><strong>???</strong></span>)</span>) <a href="https://twitter.com/stefanbache/status/603924900135510016">May 28, 2015</a>
</blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>
OK, letās assume Stefan and Hadley know what they are doing (and I invariably do) and the NSE used here really is safe. That still leaves the major problem I have with writing R code like this in package functions; how do you read it, parse it, and understand what it does? How do you track down a bug in the code and where it occurs if several steps are conflated into a single pipe chain? Iām not a pipe smoker so Iāll have to guess; you undo the chain and see where things break. Wouldnāt it have been easier to just write out the steps in the first place? That way the debugger can just step through the statements line by line as youāve written them. Iām not alone in having concerns in this general area
</p>
<blockquote class="twitter-tweet" lang="en">
<p lang="en" dir="ltr">
<a href="https://twitter.com/daattali"><span class="citation" data-cites="daattali">(<span class="citeproc-not-found" data-reference-id="daattali"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/emhrt_"><span class="citation" data-cites="emhrt_">(<span class="citeproc-not-found" data-reference-id="emhrt_"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/ucfagls"><span class="citation" data-cites="ucfagls">(<span class="citeproc-not-found" data-reference-id="ucfagls"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/noamross"><span class="citation" data-cites="noamross">(<span class="citeproc-not-found" data-reference-id="noamross"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/recology_"><span class="citation" data-cites="recology_">(<span class="citeproc-not-found" data-reference-id="recology_"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/JennyBryan"><span class="citation" data-cites="JennyBryan">(<span class="citeproc-not-found" data-reference-id="JennyBryan"><strong>???</strong></span>)</span></a> <a href="https://twitter.com/Voovarb"><span class="citation" data-cites="Voovarb">(<span class="citeproc-not-found" data-reference-id="Voovarb"><strong>???</strong></span>)</span></a> my main worry is that it makes errors harder to understand
</p>
ā Hadley Wickham (<span class="citation" data-cites="hadleywickham">(<span class="citeproc-not-found" data-reference-id="hadleywickham"><strong>???</strong></span>)</span>) <a href="https://twitter.com/hadleywickham/status/603883121197514752">May 28, 2015</a>
</blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>
I suppose a lot of this will come down to how well you grok pipes and how well you understand your actual code.
</p>
<p>
OK, enough of that; on to problem area number 2. I was recently helping a <a href="http://stackoverflow.com/q/30489799/429846">StackOverflow user massage some output</a> from a <strong>vegan</strong> function into a format suitable for plotting with <strong>ggplot2</strong>. There, the aim was to go from this:
</p>
<pre><code>Group.1 S.obs se.obs S.chao1 se.chao1
Cliona celata complex 499.7143 59.32867 850.6860 65.16366
Cliona viridis 285.5000 51.68736 462.5465 45.57289
Dysidea fragilis 358.6667 61.03096 701.7499 73.82693
Phorbas fictitius 525.9167 24.66763 853.3261 57.73494</code></pre>
<p>
to this:
</p>
<pre><code> Group.1 var S se
1 Cliona celata complex chao1 850.6860 65.16366
2 Cliona celata complex obs 499.7143 59.32867
3 Cliona viridis chao1 462.5465 45.57289
4 Cliona viridis obs 285.5000 51.68736
5 Dysidea fragilis chao1 701.7499 73.82693
6 Dysidea fragilis obs 358.6667 61.03096
7 Phorbas fictitius chao1 853.3261 57.73494
8 Phorbas fictitius obs 525.9167 24.66763</code></pre>
<p>
(or at least something pretty close it) so that the required <em>dynamite plot</em> (yes, yes, I know!) could be produced.
</p>
<p>
A little fiddling with <strong>reshape2</strong> suggested this wasnāt something that it would handle gracefully (I may well be wrong here; Iām not familiar that particular package) and having recalled some details of Hadleyās <strong>tidyr</strong> package I felt that it would be more suited to the problem at hand. Not having used <strong>tidyr</strong> I proceeded to CRAN to grab the manual and look at any vignettes that might help me with understanding how to solve this particular problem. Thankfully, Hadley is a conscientious R package maintainer and there was a rather nice <a href="http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html">HTML-rendered version of the vignette</a> right there on CRAN for me to peruse. The only downside to this was all the example code used pipes.
</p>
<p>
The very first usage example is (or was, depending on when you are reading this)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">preg2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">preg</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">treatment</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">treatmenta</span><span class="o">:</span><span class="n">treatmentb</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">treatment</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"treatment"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">treatment</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">arrange</span><span class="p">(</span><span class="n">name</span><span class="p">,</span><span class="w"> </span><span class="n">treatment</span><span class="p">)</span><span class="w">
</span><span class="n">preg2</span></code></pre>
</figure>
<p>
Innocuous enough I guess, until you realise that I"m also reading the manual which has usage that doesnāt involve pipes and that Hadley isnāt naming the arguments in the calls here. Now I am having to grok what is being passed, and where, by the pipes, whilst trying to match the usage shown in the example snippet with the arguments in the manual. I might be old-school but yes, I do read the manual.
</p>
<p>
The point Iām trying to make here with my little anecdote is this; what point did the use of the pipe serve here? How am I as a user new to the package helped by Hadley also using the pipe? In my case I wasnāt; in fact it made it somewhat trickier to understand what went where, what the actual <strong>tidyr</strong> calls were etc. Now I fully understand that Hadley finds the pipe operator to be very expressive for data analysis, and who am I to argue with that? Where I would raise an issue is that if you are writing introductory example code, donāt force your users to have to grapple with two new concepts at once, at least not in the first few examples.
</p>
<p>
I donāt want to beat on Hadley over this; itās just that this was a prime example of where the use of the pipe was obfuscatory not revelatory, for me at least.
</p>
<p>
So yes, I am a curmudgeonly old fart, but this old dog can learn new tricks. Convince me Iām wrong here cause I really do want to like the pipe; my Granddad smoked one and I have fond memories of the smell and, well, all the cool kids are using the pipe so it must be good, right?
</p>
Something is rotten in the state of Denmark
Gavin L. Simpson
2015-06-02T00:00:00-06:00
2015-06-02T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/06/02/something-rotten/
<p>
On Twitter and elsewhere there has been much wailing and gnashing of teeth for some time over one particular aspect of the R ecosphere: <a href="http://cran.r-project.org/">CRAN</a>. Iām not here to argue that everything is peachy ā far from it in fact ā but I am going to argue that the problems we face <em>do not</em> begin and end with CRAN or one or more of itās maintainers.
</p>
<p>
On Twitter and elsewhere there has been much wailing and gnashing of teeth for some time over one particular aspect of the R ecosphere: <a href="http://cran.r-project.org/">CRAN</a>. Iām not here to argue that everything is peachy ā far from it in fact ā but I am going to argue that the problems we face <em>do not</em> begin and end with CRAN or one or more of itās maintainers.
</p>
<p>
Before I let rip, in writing this I am not attempting to gloss over or otherwise dismiss the real complaints from those that feel that they have been harassed by responses from a CRAN maintainer. Itās not my place to address those issues, but rather something that the R Foundation should be handling. If true, and I have no reason to doubt the claims, there is no place for such treatment of individuals, no matter their transgression. Did you hear me? Ok, with that said, here goes.
</p>
<p>
For all the good that there is in the R community, one part of the rot that exists is with package authors. Not all package authors, mind. Just a few package authors. Some of those same people seem very vocal on Twitter and elsewhere about the perceived problems and question why CRAN has the temerity to uphold them to some quality standards. The rot, or at least a not-insignificant part of it, is those package authors that donāt give a crap about the quality of their submissions or those that donāt think the rules apply to them.
</p>
<p>
There is nothing mystical or random about getting a package on to CRAN. You create your package following the guidelines & advice in <a href="http://cran.r-project.org/doc/manuals/r-patched/R-exts.html">Writing R Extensions</a> (WRE), which, whilst verbose in places it doesnāt need to be, at least includes most of the relevant information if people would just both to read it. I hear complaints about it being some hundred and odd pages and that people donāt have time to read it. Wait, you donāt have time to read the documentation that is provided but then get all bent out of shape when a <em>volunteer</em> CRAN Maintainer calls you on your lack of effort?
</p>
<p>
Big chunks of WRE donāt apply to most packages; not including C, C++, or FORTRAN code in your package? Great, ignore the 60% of the manual that doesnāt apply to you. By my reckoning there are on the order of 70 sparse pages that cover all you need to know about writing an R package, conveniently listed in the first 2 chapters of WRE. Add 2 more pages if you want to write new generic functions and methods. How many of those complaining read the 100-odd pages of Hadley Wickhamās <a href="http://shop.oreilly.com/product/0636920034421.do">R Packages book</a> (or the equivalent <a href="http://r-pkgs.had.co.nz/">web/HTML version</a>)?
</p>
<p>
That information, those 70 pages, is what most package authors need. Yes, OK, some people will be proficient programmers writing interfaces to compiled code whoāll need to read the other 60%, but I sure as hell hope they do read it because Iād really appreciate it if their compiled code didnāt segfault my R session just because I had the nerve to use their package.
</p>
<p>
If youāve gotten your code this far, you should have a reasonably functioning package. Next step is to do what WRE tells you and run <code>R CMD check āas-cran</code> and <code>R CMD INSTALL</code> on the <strong>tarball</strong> (i.e.Ā on the thing produced by <code>R CMD build</code>, <strong>not</strong> your source tree). If there are <strong>any</strong> issues here, fix them or make a note to tell CRAN about the issue and why it is either a false positive or nothing to worry about. This is important! A lot of what CRAN does is manual; help them by telling them why they shouldnāt worry that your package generated 3 NOTEs. You probably want to check this on at least two OSes (Linux and Windows would be ideal) and under the current R release and a recent build of R-devel. The latter may be a bit of a pain but you only really need to do this when you are doing pre-flight checks for CRAN, not at all stages of development. Using the <a href="http://win-builder.r-project.org/">Win-builder service</a> run by Uwe Ligges will cover Windows and R-devel on that OS to boot. Using a continuous integration service like Travis CI or Appveyor can help with testing on Linux/OS X and Windows respectively. Using these fancy new tools isnāt <em>that</em> technical, difficult, or insurmountable; if you are building a package in the first place you already have access to one test system and Win-builder gives you another, for free and you get the R-devel ribbon on top!
</p>
<p>
Having done all of that, you need to read the CRAN policy for submissions. And re-read it. And read it <em>each</em> and <em>every</em> time you submit a package to CRAN, not just on the first occasion. It changes from time to time to reflect tightening of the policy or to accommodate changes in R and the checking system.
</p>
<p>
That done, you should be good to go. But, yes sometimes youāll have overlooked something, or your test systems werenāt configured in the same way as CRANās were. Or something else. If youāve read the policy and followed WRE, then this is the one place where some error might creep in. But you know what, if CRAN tells you to single-quote some words, or title case your <code>Title</code> tag, or put a period on the end of the <code>Description</code> tag, or something else, just fix the damn problem and get on with life. You might think this is petty, and from some points of view it probably is petty, but CRAN donāt and if you want your package on CRAN then you have to follow their rules! It really is that simple. You donāt like it? Youāre welcome to a refund and can always set up a competing repository yourself.
</p>
<p>
Some package authors complain that CRAN is sweating the insignificant details at the expense of letting through compliant-but-pointless or crap-or-broken packages. Oliver Keyes recently commented on Twitter
</p>
<blockquote class="twitter-tweet" lang="en" align="center">
<p lang="en" dir="ltr">
Let me say a big thank you to BDR for shouting at me for not using single quotes but not noticing fundamentally broken vignettes
</p>
ā Oliver Keyes (<span class="citation" data-cites="quominus">(<span class="citeproc-not-found" data-reference-id="quominus"><strong>???</strong></span>)</span>) <a href="https://twitter.com/quominus/status/605025216222265344">May 31, 2015</a>
</blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>
complaining about being asked to single-quote something or other whilst broken vignettes seemingly languish on CRAN (itās not immediately clear whether Oliver was referring to his own vignettes or something from another package). The implication here seems to be that BDR is somehow being remiss in pointing out one transgression of the rules whilst simultaneously allowing other, more serious, transgressions. This is invariably a false argument of course. If BDR #sweat[s]theshitthatdoesntmatter (as <span class="citation" data-cites="recology_">(<span class="citeproc-not-found" data-reference-id="recology_"><strong>???</strong></span>)</span> succinctly put it), he sure as hell isnāt letting an obviously ā visibly ā broken vignette through the pearly gates now is he? Of course not!
</p>
<p>
If there are broken vignettes/packages on CRAN there are two reasons
</p>
<ol type="1">
<li>
the author of the package doesnāt care about fixing the vignette/package but it is broken in a non-obvious-to-CRAN way, or
</li>
<li>
the package authorās vignette/package has broken due to recent changes in dependencies or R, (OK I guess thereās a 3rd optionā¦
</li>
<li>
the package author doesnāt know and, you know what, perhaps a friendly note to tell them would suffice )
</li>
</ol>
<p>
If the reality is 2. and the package is still on CRAN, then be thankful that CRAN is probably allowing a period of grace for the package author to fix the problem. Can you imagine the cacophony of wailing from the twitteRati if CRAN pulled their packages as soon as <code>R CMD check</code> threw an error? You might mistake such an event for the raptureā¦
</p>
<p>
If the problem is 1. then what do you want BDR or CRAN to do about it? They get lambasted for too much reliance on manual checks and then when their automated checks fail to catch a problem theyāre damned again! If the problem is 1. then the ire should be directed at the respective package author, not CRAN. The problem of broken vignettes etc is not something CRAN can do much about; that contribution to the rottenness lies squarely at the feet of R package authors.
</p>
<p>
I donāt know for sure, but I can see reasons for CRAN wanting to improve the way packages are presented and described on CRANās website. That they sweat these seemingly trivial details because if package authors get those things right, theyāre probably conscientious enough to make sure there arenāt other, undetectable-to-CRAN problems with their package. We should be lauding this attention to detail if the effort in quoting a few words and changing the case of some title or other is what stops idiots from throwing-up whatever it was they ate for lunch into a tarball destined for CRAN.
</p>
<p>
If you follow the <a href="http://dirk.eddelbuettel.com/cranberries/">CRANberries package feed</a> youāll be amazed at the number of packages that get yanked from CRAN; invariably because some problem <em>was</em> found later in their package or changes to R broke the package and the author failed to sort the problem. This all has to be handled responsibly by CRAN because they invariably have a legal obligation to continue to make the sources for those removed packages available for download. This is not a trivial exercise to dump this garbage in a responsible manner, with a human-negotiated time interval within which a problem will be fixed (note the cacophony point above). Raising the barrier to entry for R packages shipped via CRAN is, in my not so humble opinion, a good thing if it weeds out those that canāt be arsed with the effort involved in jumping through the ever-shifting hoops of WRE and CRANās policy.
</p>
<p>
So, yes there is a problem in the R community. Itās just not the entity that you all thought was the problem, at least not entirely. If the rot has set in, if the sickness has infected the community, package authors are very much partly to blame. There is no secret sauce to getting a package on to CRAN, despite what some people might think or claim. The only cure to the sickness is to sweat the detail, read the documentation, do what CRAN says. If you donāt like that, then go play somewhere else. There are hundreds, if not thousands, of package authors that have successfully navigated the treacherous waters that lie before CRANās safe harbour. You know what each of these package authors has in common? They (eventually) read the documentation and played by the rules. What makes you so special that you should get a free pass on that?
</p>
Drawing rarefaction curves with custom colours
Gavin L. Simpson
2015-04-16T00:00:00-06:00
2015-04-16T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/04/16/drawing-rarefaction-curves-with-custom-colours/
<p>
I was sent an email this week by a <strong>vegan</strong> user who wanted to draw rarefaction curves using <code>rarecurve()</code> but with different colours for each curve. The solution to this one is quite easy as <code>rarecurve()</code> has argument <code>col</code> so the user could supply the appropriate vector of colours to use when plotting. However, they wanted to distinguish all 26 of their samples, which is certainly stretching the limits of perception if we only used colour. Instead we can vary other parameters of the plotted curves to help with identifying individual samples.
</p>
<p>
I was sent an email this week by a <strong>vegan</strong> user who wanted to draw rarefaction curves using <code>rarecurve()</code> but with different colours for each curve. The solution to this one is quite easy as <code>rarecurve()</code> has argument <code>col</code> so the user could supply the appropriate vector of colours to use when plotting. However, they wanted to distinguish all 26 of their samples, which is certainly stretching the limits of perception if we only used colour. Instead we can vary other parameters of the plotted curves to help with identifying individual samples.
</p>
<p>
To illustrate, Iāll use the Barro Colorado Island data set <code>BIC</code> that comes with <strong>vegan</strong>. I just take the first 26 samples as this was the data set size my correspondent indicated they had available.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"vegan"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: permute
Loading required package: lattice
This is vegan 2.2-1</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="p">(</span><span class="n">BCI</span><span class="p">,</span><span class="w"> </span><span class="n">package</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"vegan"</span><span class="p">)</span><span class="w">
</span><span class="n">BCI2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">BCI</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">raremax</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">rowSums</span><span class="p">(</span><span class="n">BCI2</span><span class="p">))</span><span class="w">
</span><span class="n">raremax</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 340</code></pre>
</figure>
<p>
<code>raremax</code> is the minimum sample count achieved over the 26 samples. We will rarefy the sample counts to this value.
</p>
<p>
To set up the parameters we might use for plotting, <code>expand.grid()</code> is a useful helper function
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="s2">"darkred"</span><span class="p">,</span><span class="w"> </span><span class="s2">"forestgreen"</span><span class="p">,</span><span class="w"> </span><span class="s2">"orange"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="s2">"yellow"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hotpink"</span><span class="p">)</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">,</span><span class="w"> </span><span class="s2">"longdash"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dotdash"</span><span class="p">)</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lty</span><span class="p">,</span><span class="w"> </span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">pars</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> col lty
1 black solid
2 darkred solid
3 forestgreen solid
4 orange solid
5 blue solid
6 yellow solid</code></pre>
</figure>
<p>
Then we can call <code>rarecurve()</code> as follows with the new graphical parameters
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">out</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pars</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">26</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w">
</span><span class="n">rarecurve</span><span class="p">(</span><span class="n">BCI2</span><span class="p">,</span><span class="w"> </span><span class="n">step</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">raremax</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col</span><span class="p">,</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lty</span><span class="p">,</span><span class="w"> </span><span class="n">label</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/drawing-rarefaction-curves-with-custom-coloursrarecurve-1-1.png" alt="First attempt at rarefaction curves with custom colours." />
<figcaption>
First attempt at rarefaction curves with custom colours.
</figcaption>
</figure>
<p>
Note that I saved the output from <code>rarecurve()</code> in object <code>out</code>. This object contains everything we need to draw our own version of the plot if we wish. For example, we could use fewer colours and alter the line thickness<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> instead to make up the required number of combinations.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="s2">"darkred"</span><span class="p">,</span><span class="w"> </span><span class="s2">"forestgreen"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hotpink"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">)</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">,</span><span class="w"> </span><span class="s2">"dotdash"</span><span class="p">)</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lty</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwd</span><span class="p">,</span><span class="w">
</span><span class="n">stringsAsFactors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">pars</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> col lty lwd
1 black solid 1
2 darkred solid 1
3 forestgreen solid 1
4 hotpink solid 1
5 blue solid 1
6 black dashed 1</code></pre>
</figure>
<p>
Using the information in <code>out</code> returned by <code>rarecurve()</code> we can get almost the same plot using the following code to draw the elements by hand
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">Nmax</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">attr</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s2">"Subsample"</span><span class="p">)))</span><span class="w">
</span><span class="n">Smax</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">out</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Nmax</span><span class="p">)),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Smax</span><span class="p">)),</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Sample Size"</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Species"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">abline</span><span class="p">(</span><span class="n">v</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">raremax</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">out</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">attr</span><span class="p">(</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="w"> </span><span class="s2">"Subsample"</span><span class="p">)</span><span class="w">
</span><span class="n">with</span><span class="p">(</span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lty</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwd</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/drawing-rarefaction-curves-with-custom-coloursplot-custom-rarecurves-1.png" alt="Second attempt at rarefaction curves with custom colours and plotting." />
<figcaption>
Second attempt at rarefaction curves with custom colours and plotting.
</figcaption>
</figure>
<p>
Having done this, I donāt believe this is a useful graphic because weāre trying to distinguish between too many samples using graphical parameters. Where I do think this sort of approach might work is if the samples in the data set come from a few different groups and we want to colour the curves by group.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"darkred"</span><span class="p">,</span><span class="w"> </span><span class="s2">"forestgreen"</span><span class="p">,</span><span class="w"> </span><span class="s2">"hotpink"</span><span class="p">,</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">grp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">sample</span><span class="p">(</span><span class="nf">seq_along</span><span class="p">(</span><span class="n">col</span><span class="p">),</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">BCI2</span><span class="p">),</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">col</span><span class="p">[</span><span class="n">grp</span><span class="p">]</span></code></pre>
</figure>
<p>
The code above creates a grouping factor <code>grp</code> for illustration purposes; in real analyses youād have this already as a factor variable in you data somewhere. We also have to expand the <code>col</code> vector because we are plotting each line in a loop. The plot code, reusing elements from the previous plot, is shown below:
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Nmax</span><span class="p">)),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="n">Smax</span><span class="p">)),</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Sample Size"</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Species"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="n">abline</span><span class="p">(</span><span class="n">v</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">raremax</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">out</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">attr</span><span class="p">(</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="w"> </span><span class="s2">"Subsample"</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]],</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/drawing-rarefaction-curves-with-custom-coloursplot-custom-rarecurves-2-1.png" alt="An attempt at rarefaction curves output with custom colours per groups of curves." />
<figcaption>
An attempt at rarefaction curves output with custom colours per groups of curves.
</figcaption>
</figure>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
We canāt use the approach outlined in this example to vary <code>lwd</code> because of the way <code>rarecurve()</code> draws the individual curves, in a loop. We have no way to tell <code>rarecurve()</code> to use the <em>i</em>th element of a vector of <code>lwd</code> values.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
At the frontiers of palaeoecology
Gavin L. Simpson
2015-03-31T00:00:00-06:00
2015-03-31T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/03/31/at-the-frontiers-of-palaeoecology/
<p>
A couple of weeks ago, I had the pleasure of attending and participating in a symposium held to honour John Birks as he retires from the University of Bergen and becomes Professor Emeritus. The symposium, titled āAt the Frontiers of Palaeoecologyā, took place on 19ā20th March in Bergen, Norway, and was a wonderful mix of colleagues old and new discussing Johnās contributions to the field of palaeoecology and their collaborations with him. Alongside this reminiscing were several presentations describing new areas of research by colleagues and collaborators of John.
</p>
<p>
A couple of weeks ago, I had the pleasure of attending and participating in a symposium held to honour John Birks as he retires from the University of Bergen and becomes Professor Emeritus. The symposium, titled āAt the Frontiers of Palaeoecologyā, took place on 19ā20th March in Bergen, Norway, and was a wonderful mix of colleagues old and new discussing Johnās contributions to the field of palaeoecology and their collaborations with him. Alongside this reminiscing were several presentations describing new areas of research by colleagues and collaborators of John.
</p>
<p>
I gave a talk on the first day, which was of the latter type. I made the case for the wider use among palaeoecologists of modern statistical methods that allow us to handle palaeoecological data as time series. For the most part, limited consideration has been given to the temporal aspects of stratigraphic data, principally because classical time series methods assume equally spaced observations and our dating methods come with considerable errors attached.
</p>
<p>
The slides from my presentation are available via <a href="http://doi.org/10.6084/m9.figshare.1354040">Figshare</a>.
</p>
<div style="margin-left: auto; margin-right: auto; width: 700px; height: 601px;">
<p>
<iframe src="https://wl.figshare.com/articles/1354040/embed?show_title=1" frameborder="0" width="700" height="601">
</iframe>
</p>
</div>
Harvesting Canadian climate data
Gavin L. Simpson
2015-01-14T00:00:00-06:00
2015-01-14T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2015/01/14/harvesting-canadian-climate-data/
<p>
In December I found myself helping one of our graduate students with a data problem; for one of their thesis chapters they needed a lot of hourly climate data for a handful of stations around Saksatchewan. All of this data was and is available for download from the Government of Canadaās website, but with one catch; you had to download the hourly data one month at a time, manually! There is no interface to allow a user of the website to specify the data range they want and download all the data from a single station. I figured there had to be a better way, using R to automate the downloading. Thinking the solution I came up with might help other researchers needing to grab data from the Government of Canadaās website save some time in the future, I wrote this post to document how we ended up doing it.
</p>
<p>
In December I found myself helping one of our graduate students with a data problem; for one of their thesis chapters they needed a lot of hourly climate data for a handful of stations around Saksatchewan. All of this data was and is available for download from the Government of Canadaās website, but with one catch; you had to download the hourly data one month at a time, manually! There is no interface to allow a user of the website to specify the data range they want and download all the data from a single station. I figured there had to be a better way, using R to automate the downloading. Thinking the solution I came up with might help other researchers needing to grab data from the Government of Canadaās website save some time in the future, I wrote this post to document how we ended up doing it.
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/screenshot-gc-climate-website.jpeg" title="Screenshot of Government of Canada's climate website" alt="Screenshot of Government of Canadaās climate website" />
<figcaption>
Screenshot of Government of Canadaās climate website
</figcaption>
</figure>
<p>
The website itself is reasonably pretty but the way the web form worked to trigger the download of a CSV containing the data was a little tricky. You can see an example of the sort of data we were interested in <a href="http://climate.weather.gc.ca/climateData/hourlydata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1996-01-30%7C2015-01-12&Year=2015&Month=1&Day=12">here</a>; interestingly you are only shown a single day of data but when you click the big Download button you get the entire month containing the day shown in the HTML table. The web form was setting some hidden parameters that were added to the current pageās URL once the Download button was clicked. Frustratingly, the same page that showed the HTML table also handled generating and returning the CSV download. Even more frustrating was that the script that they were using needed GET variables with almost the same names as some of the existing GET variables, just with different case, such as <code>StationID</code> and <code>stationID</code>, the latter of which is required for the CSV-creating script only. A further annoyance was that even though the CSV generated contained an entire monthās worth of data, the URL still needed to contain the <code>Day</code> GET variable.
</p>
<p>
Iām sure I havenāt whittled the URL down to the bare minimum required to trigger CSV generation and download, but I ended up using:
</p>
<pre><code>http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1996-01-30%7C2014-11-30&cmdB1=Go&Year=2003&Month=5&Day=27&format=csv&stationID=28011</code></pre>
<p>
which will get you the data for May 2003 from station 28011 (Regina RCS).
</p>
<p>
Having figured that out, I needed a little function that would generate the URLs weād need to visit to get data covering the periods we wanted. Because the student needed multiple stations and the time periods of interest differed between stations (stations got moved and picked up new IDs so we needed to track those movements) I wrote a little function that would create a whole load of URLS if given a set of station IDs and start and end years.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">genURLS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">start</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">nyears</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">years</span><span class="p">)</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nyears</span><span class="p">)</span><span class="w">
</span><span class="n">URLS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID="</span><span class="p">,</span><span class="w">
</span><span class="n">id</span><span class="p">,</span><span class="w">
</span><span class="s2">"&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year="</span><span class="p">,</span><span class="w">
</span><span class="n">years</span><span class="p">,</span><span class="w">
</span><span class="s2">"&Month="</span><span class="p">,</span><span class="w">
</span><span class="n">months</span><span class="p">,</span><span class="w">
</span><span class="s2">"&Day=27"</span><span class="p">,</span><span class="w">
</span><span class="s2">"&format=csv"</span><span class="p">,</span><span class="w">
</span><span class="s2">"&stationID="</span><span class="p">,</span><span class="w">
</span><span class="n">id</span><span class="p">)</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="n">urls</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">URLS</span><span class="p">,</span><span class="w"> </span><span class="n">ids</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">nyears</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">12</span><span class="p">),</span><span class="w"> </span><span class="n">years</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">months</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">months</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
The <code>genURLS()</code> function is pretty simple and just repeats each year integer in the sequence <code>start:end</code> 12 times, once per month, and then repeats the months <code>1:12</code> for as many years were requested. Then it builds up a character vector of URLs from these vectors <code>years</code>, <code>months</code> and <code>id</code>, the station ID.
</p>
<p>
If we wanted all the data for 2014 for the Regina RCS station then we could generate the URLs weād need to visit as follows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">regina</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">genURLS</span><span class="p">(</span><span class="m">28011</span><span class="p">,</span><span class="w"> </span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="m">2014</span><span class="p">)</span><span class="w">
</span><span class="nf">length</span><span class="p">(</span><span class="n">regina</span><span class="o">$</span><span class="n">urls</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">regina</span><span class="o">$</span><span class="n">urls</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 12
[1] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=1&Day=27&format=csv&stationID=28011"
[2] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=2&Day=27&format=csv&stationID=28011"
[3] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=3&Day=27&format=csv&stationID=28011"
[4] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=4&Day=27&format=csv&stationID=28011"
[5] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=5&Day=27&format=csv&stationID=28011"
[6] "http://climate.weather.gc.ca/climateData/bulkdata_e.html?timeframe=1&Prov=SK&StationID=28011&hlyRange=1953-01-30%7C2014-12-31&cmdB1=Go&Year=2014&Month=6&Day=27&format=csv&stationID=28011"</code></pre>
</figure>
<p>
The function I used to grab all the data is a little more involved, partly because in a long-running job you donāt want a single error due to a bad download to cause the entire job to end. Another reason for some of the complexity is that if the job did fail for some reason, as long as the files downloaded up to that point were OK/readable, I didnāt want to download them again. Therefore the function downloads and saves all the CSV files first and only then do we try to read the data. The function is reasonably well-commented so I wonāt dwell on those details
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">getData</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">stations</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">delete</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1">## form URLS</span><span class="w">
</span><span class="n">urls</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="nf">seq_len</span><span class="p">(</span><span class="n">NROW</span><span class="p">(</span><span class="n">stations</span><span class="p">)),</span><span class="w">
</span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="n">stations</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">genURLS</span><span class="p">(</span><span class="n">stations</span><span class="o">$</span><span class="n">StationID</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w">
</span><span class="n">stations</span><span class="o">$</span><span class="n">start</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w">
</span><span class="n">stations</span><span class="o">$</span><span class="n">end</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="n">stations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stations</span><span class="p">)</span><span class="w">
</span><span class="c1">## check the folder exists and try to create it if not</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">folder</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">warning</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="s2">"Directory:"</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="p">,</span><span class="w">
</span><span class="s2">"doesn't exist. Will create it"</span><span class="p">))</span><span class="w">
</span><span class="n">fc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">dir.create</span><span class="p">(</span><span class="n">folder</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">fc</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="s2">"Failed to create directory '"</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="p">,</span><span class="w">
</span><span class="s2">"'. Check path and permissions."</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Extract the data from the URLs generation</span><span class="w">
</span><span class="n">URLS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"urls"</span><span class="p">))</span><span class="w">
</span><span class="n">sites</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"ids"</span><span class="p">))</span><span class="w">
</span><span class="n">years</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"years"</span><span class="p">))</span><span class="w">
</span><span class="n">months</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">lapply</span><span class="p">(</span><span class="n">urls</span><span class="p">,</span><span class="w"> </span><span class="s1">'[['</span><span class="p">,</span><span class="w"> </span><span class="s2">"months"</span><span class="p">))</span><span class="w">
</span><span class="c1">## filenames to use to save the data</span><span class="w">
</span><span class="n">fnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste</span><span class="p">(</span><span class="n">sites</span><span class="p">,</span><span class="w"> </span><span class="n">years</span><span class="p">,</span><span class="w"> </span><span class="n">months</span><span class="p">,</span><span class="w"> </span><span class="s2">"data.csv"</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"-"</span><span class="p">)</span><span class="w">
</span><span class="n">fnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="n">folder</span><span class="p">,</span><span class="w"> </span><span class="n">fnames</span><span class="p">)</span><span class="w">
</span><span class="n">nfiles</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">fnames</span><span class="p">)</span><span class="w">
</span><span class="c1">## set up a progress bar if being verbose</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">pb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">txtProgressBar</span><span class="p">(</span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nfiles</span><span class="p">,</span><span class="w"> </span><span class="n">style</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="nf">on.exit</span><span class="p">(</span><span class="n">close</span><span class="p">(</span><span class="n">pb</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">out</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vector</span><span class="p">(</span><span class="n">mode</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"list"</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nfiles</span><span class="p">)</span><span class="w">
</span><span class="n">cnames</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Date/Time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Year"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Month"</span><span class="p">,</span><span class="s2">"Day"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Time"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Data Quality"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Temp (degC)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Temp Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Dew Point Temp (degC)"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Dew Point Temp Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Rel Hum (%)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Rel Hum Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Wind Dir (10s deg)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Wind Dir Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Wind Spd (km/h)"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Wind Spd Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Visibility (km)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Visibility Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Stn Press (kPa)"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Stn Press Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Hmdx"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Hmdx Flag"</span><span class="p">,</span><span class="w">
</span><span class="s2">"Wind Chill"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Wind Chill Flag"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Weather"</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nfiles</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">curfile</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fnames</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="c1">## Have we downloaded the file before?</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">file.exists</span><span class="p">(</span><span class="n">curfile</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># No: download it</span><span class="w">
</span><span class="n">dload</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">download.file</span><span class="p">(</span><span class="n">URLS</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">destfile</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">curfile</span><span class="p">,</span><span class="w"> </span><span class="n">quiet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">dload</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># If problem, store failed URL...</span><span class="w">
</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">URLS</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="c1"># update progress bar...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">next</span><span class="w"> </span><span class="c1"># bail out of current iteration</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Must have downloaded, try to read file</span><span class="w">
</span><span class="c1">## skip first 16 rows of header stuff</span><span class="w">
</span><span class="c1">## encoding must be latin1 or will fail - may still be problems with character set</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">read.csv</span><span class="p">(</span><span class="n">curfile</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"latin1"</span><span class="p">),</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1">## Did we have a problem reading the data?</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># yes handle read problem</span><span class="w">
</span><span class="c1">## try to fix the problem with dodgy characters</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readLines</span><span class="p">(</span><span class="n">curfile</span><span class="p">)</span><span class="w"> </span><span class="c1"># read all lines in file</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\x87"</span><span class="p">,</span><span class="w"> </span><span class="s2">"x"</span><span class="p">,</span><span class="w"> </span><span class="n">cdata</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove the dodgy symbol for partner data in Data Quality</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\xb0"</span><span class="p">,</span><span class="w"> </span><span class="s2">"deg"</span><span class="p">,</span><span class="w"> </span><span class="n">cdata</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove the dodgy degree symbol in column names</span><span class="w">
</span><span class="n">writeLines</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="n">curfile</span><span class="p">)</span><span class="w"> </span><span class="c1"># write the data back to the file</span><span class="w">
</span><span class="c1">## try to read the file again, if still an error, bail out</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">read.csv</span><span class="p">(</span><span class="n">curfile</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">encoding</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"latin1"</span><span class="p">),</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">inherits</span><span class="p">(</span><span class="n">cdata</span><span class="p">,</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># yes, still!, handle read problem</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">delete</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">file.remove</span><span class="p">(</span><span class="n">curfile</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove file if a problem & deleting</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">URLS</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="c1"># record failed URL...</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="c1"># update progress bar...</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">next</span><span class="w"> </span><span class="c1"># bail out of current iteration</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1">## Must have (eventually) read file OK, add station data</span><span class="w">
</span><span class="n">cdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind.data.frame</span><span class="p">(</span><span class="n">StationID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">sites</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">NROW</span><span class="p">(</span><span class="n">cdata</span><span class="p">)),</span><span class="w">
</span><span class="n">cdata</span><span class="p">)</span><span class="w">
</span><span class="nf">names</span><span class="p">(</span><span class="n">cdata</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cnames</span><span class="w">
</span><span class="n">out</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cdata</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">isTRUE</span><span class="p">(</span><span class="n">verbose</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># Update the progress bar</span><span class="w">
</span><span class="n">setTxtProgressBar</span><span class="p">(</span><span class="n">pb</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">out</span><span class="w"> </span><span class="c1"># return</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
The main infelicity is that you have to supply the <code>getData()</code> with a data frame containing the station IDs and start and end years respectively for the data you want to collect. This suited my needs as we wanted to grab data from 10 stations with different start and end years as required to track station movements. Itās not as convenient if you only want to grab the data for a single station, however.
</p>
<p>
One thing youāll note quickly if you start downloading data using this function is that the web script the Government of Canada is using on their climate website will quite happily generate a fully-formed file containing no actual data (but with all the headers, hourly time stamps, etc) if you ask it for data outside the window of observations for a given station. There are no errors, just lots of mostly empty files, bar the header and labels.
</p>
<p>
One other thing to note is that <code>getData()</code> returns the downloaded data as a list and no attempt is made to flatten the individual components to a single large data frame. Thatās because it allows for any failed data downloads (or reads) and records the failed URL instead of the data. This gives you a chance to manually check those URLs to see what the problem might be before re-running the job, which because we saved all the CSVs will run very quickly from that local cache.
</p>
<p>
To see <code>getData()</code> in action, weāll run a quick job, downloading the 2014 data for two stations
</p>
<ul>
<li>
Regina INTL A (51441)
</li>
<li>
Indian Head CDA (2925)
</li>
</ul>
<p>
First we create a data frame of station information
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">stations</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">StationID</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">51441</span><span class="p">,</span><span class="w"> </span><span class="m">2925</span><span class="p">),</span><span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">2014</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span></code></pre>
</figure>
<p>
Then we pass this to <code>getData()</code> with the path to the folder we wish to cache downloaded CSVs in
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">met</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">getData</span><span class="p">(</span><span class="n">stations</span><span class="p">,</span><span class="w"> </span><span class="n">folder</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"./csv"</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<p>
This will take a few minutes to run, even for just 24 files, as the site is not the quickest to respond to requests (or perhaps they are now throttling my workstationās IP?). Note I turned off the printing of the progress bar here, only because this doesnāt play nicely with <strong>knitr</strong>ās capturing of the output. In real use, youāll want to leave the progress bar on (which it is by default) so you see how long you have to wait till the job is done.
</p>
<p>
Once this has finished, we can quickly determine if there were any failures
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="nf">any</span><span class="p">(</span><span class="n">failed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">met</span><span class="p">,</span><span class="w"> </span><span class="n">is.character</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] FALSE</code></pre>
</figure>
<p>
If any had failed, the <code>failed</code> logical vector could be used to index into <code>met</code> to extract the URLs that encountered problems, e.g.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">unlist</span><span class="p">(</span><span class="n">met</span><span class="p">[</span><span class="n">failed</span><span class="p">])</span></code></pre>
</figure>
<p>
If there were no problems, then the components of <code>met</code> can be bound into a data frame using <code>rbind()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">met</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">do.call</span><span class="p">(</span><span class="s2">"rbind"</span><span class="p">,</span><span class="w"> </span><span class="n">met</span><span class="p">)</span></code></pre>
</figure>
<p>
The data now looks like this
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">head</span><span class="p">(</span><span class="n">met</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> StationID Date.Time Year Month Day Time Data.Quality Temp...C.
1 51441 2014-01-01 00:00 2014 1 1 00:00 ** -23.3
2 51441 2014-01-01 01:00 2014 1 1 01:00 ** -23.1
3 51441 2014-01-01 02:00 2014 1 1 02:00 ** -22.8
4 51441 2014-01-01 03:00 2014 1 1 03:00 ** -23.3
5 51441 2014-01-01 04:00 2014 1 1 04:00 ** -24.3
6 51441 2014-01-01 05:00 2014 1 1 05:00 ** -24.3
Temp.Flag Dew.Point.Temp...C. Dew.Point.Temp.Flag Rel.Hum....
1 -26.3 77
2 -26.1 77
3 -25.8 77
4 -26.3 77
5 -27.1 78
6 -27.0 79
Rel.Hum.Flag Wind.Dir..10s.deg. Wind.Dir.Flag Wind.Spd..km.h.
1 13 <NA> 22
2 12 <NA> 26
3 12 <NA> 22
4 13 <NA> 18
5 13 <NA> 14
6 9 <NA> 6
Wind.Spd.Flag Visibility..km. Visibility.Flag Stn.Press..kPa.
1 19.3 <NA> 95.38
2 24.1 <NA> 95.38
3 24.1 <NA> 95.39
4 24.1 <NA> 95.47
5 24.1 <NA> 95.56
6 24.1 <NA> 95.60
Stn.Press.Flag Hmdx Hmdx.Flag Wind.Chill Wind.Chill.Flag
1 NA NA -35 NA
2 NA NA -36 NA
3 NA NA -35 NA
4 NA NA -34 NA
5 NA NA -34 NA
6 NA NA -30 NA
Weather
1 Snow,Blowing Snow
2 Snow,Blowing Snow
3 Snow,Blowing Snow
4 Snow,Blowing Snow
5 Snow
6 <NA></code></pre>
</figure>
<p>
Yep, a bit of a mess; some post processing is required if you want tidy <code>names</code> etc. The student was only interested in temperature and relative humidity so I dropped all the other met data and data quality columns and then only had to update a few variable names. <del>I purposely didnāt have <code>getData()</code> fix this in case the data format on the Government of Canadaās climate website changes.</del> <strong>Update</strong> I had to change this behaviour to allow <code>getData()</code> to process some degenerate CSV files with odd characters in the column name data and the data quality field (see the comments for details). The columns names are hardcoded but retain the messy names as given to them by the Government of Canadaās webmaster. Cleaning up afterwards is advised still.
</p>
<p>
A final note, I could have run this over all the cores in my workstation or even on all the computers in my small computer cluster but I didnāt, instead choosing to run on a single core overnight to get the data we needed. Please be a good netizen if you do use the functions Iāve discussed here as other people will no doubt want to access the Government of Canadaās website. Donāt flood the site with requests!
</p>
<p>
If you have any suggestions for improvements or changes, let me know in the comments. The latest versions of the <code>genURLS()</code> and <code>getData()</code> functions can be found in this Github <a href="https://gist.github.com/gavinsimpson/8c13e3c5f905fd67cf85">gist</a>.
</p>
Analysing a randomised complete block design with vegan
Gavin L. Simpson
2014-11-03T00:00:00-06:00
2014-11-03T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2014/11/03/randomized-complete-block-designs-and-vegan/
<p>
It has been a long time coming. <a href="http://cran.r-project.org/package=vegan"><strong>Vegan</strong></a> now has in-built, native ability to use restricted permutation designs when testing effects in constrained ordinations and in range of other methods. This new-found functionality comes courtesy of Jari (mainly) and my efforts to have vegan permutation routines use the <a href="http://cran.r-project.org/package=permute"><strong>permute</strong></a> package. Jari also cooked up a standard interface that we can use to drop this and some extra features neatly into any function we want; this allows us to have permutation tests run on many CPU cores in parallel, splitting the computational burden and reducing the run time of tests, and also a mechanism that allows users to pass a matrix of user-defined permutations to be used in tests. These new features are now fully working in the development version of <strong>vegan</strong>, which you can find on <a href="https://github.com/vegandevs/vegan">github</a>, and which should be released to CRAN shortly. Ahead of the release, Iām preparing some examples to show off the new capabilities; first off I look at data from a randomized, complete block design experiment analysed using RDA & restricted permutations.
</p>
<p>
It has been a long time coming. <a href="http://cran.r-project.org/package=vegan"><strong>Vegan</strong></a> now has in-built, native ability to use restricted permutation designs when testing effects in constrained ordinations and in range of other methods. This new-found functionality comes courtesy of Jari (mainly) and my efforts to have vegan permutation routines use the <a href="http://cran.r-project.org/package=permute"><strong>permute</strong></a> package. Jari also cooked up a standard interface that we can use to drop this and some extra features neatly into any function we want; this allows us to have permutation tests run on many CPU cores in parallel, splitting the computational burden and reducing the run time of tests, and also a mechanism that allows users to pass a matrix of user-defined permutations to be used in tests. These new features are now fully working in the development version of <strong>vegan</strong>, which you can find on <a href="https://github.com/vegandevs/vegan">github</a>, and which should be released to CRAN shortly. Ahead of the release, Iām preparing some examples to show off the new capabilities; first off I look at data from a randomized, complete block design experiment analysed using RDA & restricted permutations.
</p>
<p>
To follow this example locally youāll need to have version 2.1-43 or later of <strong>vegan</strong> installed. You can grab the <a href="https://github.com/vegandevs/vegan">sources from github</a> and build it yourself, or grab a Windows binary from the <a href="https://ci.appveyor.com/project/gavinsimpson/vegan/branch/master/artifacts">Appveyor Continuous integration service</a> that weāre using to test on that platform ā you want the <code>.zip</code> file from the Artefacts. Once youāve sorted out the installation, we can begin.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"vegan"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: permute
Loading required package: lattice
This is vegan 2.1-43</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"gdata"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
Attaching package: 'gdata'
The following object is masked from 'package:stats':
nobs
The following object is masked from 'package:utils':
object.size</code></pre>
</figure>
<p>
Weāll need <strong>gdata</strong>, and its <code>read.xls()</code> function, to read from the XLS format files that the data for the example come as.
</p>
<p>
The data set itself is quite simple and small, consisting of counts on 23 species from 16 plots, and arise from a randomised complete block designed experiment described by Å paÄkovĆ” and colleagues <span class="citation" data-cites="Spackova1998-ad">(1998)</span> and analysed by <span class="citation" data-cites="Smilauer2014-ac">(Å milauer and LepÅ”, 2014)</span> in their recent book using Canoco v5.
</p>
<p>
The experiment tested the effects on seedling recruitment to a range of treatments
</p>
<ul>
<li>
control
</li>
<li>
removal of litter
</li>
<li>
removal of the dominant species <em>Nardus stricta</em>
</li>
<li>
removal of litter and moss (mos couldnāt be removed without also removing litter)
</li>
</ul>
<p>
The treatments were replicated replicated in four, randomised complete blocks.
</p>
<p>
The data are available from the accompanying website to the book <em>Multivariate Analysis of Ecological Data using CANOCO 5</em> <span class="citation" data-cites="Smilauer2014-ac">(Å milauer and LepÅ”, 2014)</span>. They are supplied as XLS format files in a ZIP archive. We can read these into R directly from the website with a little bit of effort
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## Download the data zip</span><span class="w">
</span><span class="n">furl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"http://regent.prf.jcu.cz/maed2/chap15.zip"</span><span class="w">
</span><span class="n">td</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempdir</span><span class="p">()</span><span class="w">
</span><span class="n">tf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">(</span><span class="n">tmpdir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">td</span><span class="p">,</span><span class="w"> </span><span class="n">fileext</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">".zip"</span><span class="p">)</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="n">furl</span><span class="p">,</span><span class="w"> </span><span class="n">tf</span><span class="p">)</span><span class="w">
</span><span class="c1">## list the files in the zip, we want the xls version (file 3)</span><span class="w">
</span><span class="n">fname</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unzip</span><span class="p">(</span><span class="n">tf</span><span class="p">,</span><span class="w"> </span><span class="n">list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="o">$</span><span class="n">Name</span><span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">unzip</span><span class="p">(</span><span class="n">tf</span><span class="p">,</span><span class="w"> </span><span class="n">files</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">fname</span><span class="p">,</span><span class="w"> </span><span class="n">exdir</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">td</span><span class="p">,</span><span class="w"> </span><span class="n">overwrite</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="c1"># unzip</span><span class="w">
</span><span class="n">datpath</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">file.path</span><span class="p">(</span><span class="n">td</span><span class="p">,</span><span class="w"> </span><span class="n">fname</span><span class="p">)</span><span class="w"> </span><span class="c1"># path to xls</span><span class="w">
</span><span class="c1">## read the xls file, sheet 2 contains species data, sheet 3 the env</span><span class="w">
</span><span class="n">spp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.xls</span><span class="p">(</span><span class="n">datpath</span><span class="p">,</span><span class="w"> </span><span class="n">sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">skip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">env</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.xls</span><span class="p">(</span><span class="n">datpath</span><span class="p">,</span><span class="w"> </span><span class="n">sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">row.names</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<p>
The <code>block</code> variable is currently coded as an integer and needs converting to a factor if we are to use it correctly in the analysis
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">env</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">env</span><span class="p">,</span><span class="w"> </span><span class="n">block</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">block</span><span class="p">))</span></code></pre>
</figure>
<p>
The gradient lengths are short,
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">decorana</span><span class="p">(</span><span class="n">spp</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Call:
decorana(veg = spp)
Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
DCA1 DCA2 DCA3 DCA4
Eigenvalues 0.1759 0.1898 0.11004 0.05761
Decorana values 0.2710 0.1822 0.07219 0.02822
Axis lengths 1.9821 1.4140 1.15480 0.87680</code></pre>
</figure>
<p>
motivating the use of redundancy analysis (RDA). Additionally, we may be interested in how the raw abundance of seedlings change following experimental manipulation, o we may wish to focus on the proportional differences between treatments. The first case is handled naturaly by RDA. The second case will require some form of standardisation by samples, say by sample totals.
</p>
<p>
First, letās test the first null hypothesis; that there is no effect of the treatment on seedling recruitment. This is a simple RDA. We should take into account the <code>block</code> factor when we assess this model for significance. How we do this illustrates two potential approaches to performing permutation tests
</p>
<ol type="1">
<li>
<p>
<strong>design</strong>-based permutations, where how the samples are permuted follows the experimental design, or
</p>
</li>
<li>
<p>
<strong>model</strong>-based permutations, where the experimental design is included in the analysis directly and residuals are permuted by simple randomisation.
</p>
</li>
</ol>
<p>
There is an important difference between the two approach, one which Iāll touch on shortly.
</p>
<p>
Weāll proceed by fitting the model, conditioning on <code>block</code> to remove between block differences
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mod1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rda</span><span class="p">(</span><span class="n">spp</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Condition</span><span class="p">(</span><span class="n">block</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">env</span><span class="p">)</span><span class="w">
</span><span class="n">mod1</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Call: rda(formula = spp ~ treatment + Condition(block), data =
env)
Inertia Proportion Rank
Total 990.8000 1.0000
Conditional 166.1000 0.1676 3
Constrained 329.8000 0.3329 3
Unconstrained 494.9000 0.4995 9
Inertia is variance
Eigenvalues for constrained axes:
RDA1 RDA2 RDA3
284.81 30.83 14.20
Eigenvalues for unconstrained axes:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
226.83 139.51 72.77 30.11 9.81 9.14 2.80 2.19 1.73 </code></pre>
</figure>
<p>
There is a strong single, linear gradient in the data as evidenced by the relative magnitudes of the eigenvalues (here expressed as proportions of the total variance)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">eigenvals</span><span class="p">(</span><span class="n">mod1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">mod1</span><span class="o">$</span><span class="n">tot.chi</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> RDA1 RDA2 RDA3 PC1 PC2 PC3
0.28746238 0.03111202 0.01432998 0.22893569 0.14080915 0.07344450
PC4 PC5 PC6 PC7 PC8 PC9
0.03038815 0.00989932 0.00922185 0.00282396 0.00221132 0.00174669 </code></pre>
</figure>
<h2 id="design-based-permutations">
Design-based permutations
</h2>
<p>
A <em>design</em>-based permutation test of these data would be on conditioned on the <code>block</code> variable, by restricting permutation of sample only <em>within</em> the levels of <code>block</code>. In this situation, samples are never permuted between blocks, only within. We can set up this type of permutation design as follows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">h</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">how</span><span class="p">(</span><span class="n">blocks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">env</span><span class="o">$</span><span class="n">block</span><span class="p">,</span><span class="w"> </span><span class="n">nperm</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">999</span><span class="p">)</span></code></pre>
</figure>
<p>
Note that we could use the <code>plots</code> argument instead of <code>blocks</code> to restrict the permutations in the same way, but using <code>blocks</code> is simpler. I also set the required number of permutations for the test here.
</p>
<p>
Constrained ordinations in <strong>vegan</strong> are tested using the <code>anova()</code> function. New in the development version of the package is the <code>permutations</code> argument, which is the key to supplying instructions on how you want to permute to <code>anova()</code>. <code>permutations</code> can take a number of different types of instruction
</p>
<ol type="1">
<li>
<p>
an object of class <code>āhowā</code>, whch contains details of a restricted permutation design that <code>shuffleSet()</code> from the <strong>permute</strong> package will use to generate permutations from, or
</p>
</li>
<li>
<p>
a number indicating the number of permutations required, in which case these are simple randomisations with no restriction, unless the <code>strata</code> argument is used, or
</p>
</li>
<li>
<p>
a matrix of user-specified permutations, 1 row per permutation.
</p>
</li>
</ol>
<p>
To perform the design-based permutation weāll pass <code>h</code>, created earlier, to <code>anova()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">p1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">anova</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">permutations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">,</span><span class="w"> </span><span class="n">parallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">p1</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Permutation test for rda under reduced model
Blocks: env$block
Permutation: free
Number of permutations: 999
Model: rda(formula = spp ~ treatment + Condition(block), data = env)
Df Variance F Pr(>F)
Model 3 329.84 1.9995 0.086 .
Residual 9 494.88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
</figure>
<p>
Note that Iāve run this on three cores in parallel; this is another new feature of the development version of <strong>vegan</strong> and can considerably reduce the time needed to run permutation tests. I have four cores on my laptop but left one free for the other software I have running.
</p>
<p>
The overall permutation test indicates no significant effect of treatment on the abundance of seedlings. We can test individual axes by adding <code>by = āaxisā</code> to the <code>anova()</code> call
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">24</span><span class="p">)</span><span class="w">
</span><span class="n">p1axis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">anova</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">permutations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">,</span><span class="w"> </span><span class="n">parallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"axis"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Loading required package: parallel</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">p1axis</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Permutation test for rda under reduced model
Marginal tests for axes
Blocks: env$block
Permutation: free
Number of permutations: 999
Model: rda(formula = spp ~ treatment + Condition(block), data = env)
Df Variance F Pr(>F)
RDA1 1 284.81 5.1797 0.018 *
RDA2 1 30.83 0.5606 0.691
RDA3 1 14.20 0.2582 0.923
Residual 9 494.88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
</figure>
<p>
This confirms the earlier impression that there is a single, linear gradient in the data set. A biplot shows that this axis of variation is associated with the Moss (& Litter) removal treatment. The variation between the other treatments lies primarily along axis two and is substantially less than that associated with the Moss & Litter removal.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">plot</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">display</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"species"</span><span class="p">,</span><span class="w"> </span><span class="s2">"cn"</span><span class="p">),</span><span class="w"> </span><span class="n">scaling</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-10.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.5</span><span class="p">))</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">display</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"species"</span><span class="p">,</span><span class="w"> </span><span class="n">scaling</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">display</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cn"</span><span class="p">,</span><span class="w"> </span><span class="n">scaling</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"blue"</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.2</span><span class="p">,</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Control"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Litter+Moss"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Litter"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Removal"</span><span class="p">))</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/randomized-complete-block-design-and-vegan-biplot-1.png" alt="Figure 1: RDA biplot showing species scores and treatment centroids." />
<figcaption>
Figure 1: RDA biplot showing species scores and treatment centroids.
</figcaption>
</figure>
<p>
In the above figure, I used <code>scaling = 1</code>, so-called <em>inter-sample distance scaling</em>, as this best represents the centroid scores, which are computed as the treatment-wise average of the sample scores.
</p>
<h2 id="model-based-permutation">
Model-based permutation
</h2>
<p>
The alternative permutation approach, known as model-based permutations, and would employ free permutation of residuals after the effects of the covariables have been accounted for. This is justified because under the null hypothesis, the residuals are freely exchangeable once the effects of the covariables are removed. There is a clear advantage of model-based permutations over design-based permutations; where the sample size is small, as it is here, there tends to be few blocks and the resulting design-based permutation test relatively weak compared to the model-based version.
</p>
<p>
It is simple to switch to model-based permutations, be setting the blocks indicator in the permutation design to <code>NULL</code>, removing the blocking structure from the design
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">setBlocks</span><span class="p">(</span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NULL</span><span class="w"> </span><span class="c1"># remove blocking</span><span class="w">
</span><span class="n">getBlocks</span><span class="p">(</span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="c1"># confirm</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">NULL</code></pre>
</figure>
<p>
Next we repeat the permutation test using the modified <code>h</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">51</span><span class="p">)</span><span class="w">
</span><span class="n">p2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">anova</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">permutations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">,</span><span class="w"> </span><span class="n">parallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">p2</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999
Model: rda(formula = spp ~ treatment + Condition(block), data = env)
Df Variance F Pr(>F)
Model 3 329.84 1.9995 0.068 .
Residual 9 494.88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
</figure>
<p>
The estimated <em>p</em> value is slightly smaller now. The difference between treatments is predominantly in the Moss & Litter removal with differences between the control and the other treatments lying along the insignificant axes
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">83</span><span class="p">)</span><span class="w">
</span><span class="n">p2axis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">anova</span><span class="p">(</span><span class="n">mod1</span><span class="p">,</span><span class="w"> </span><span class="n">permutations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">,</span><span class="w"> </span><span class="n">parallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"axis"</span><span class="p">)</span><span class="w">
</span><span class="n">p2axis</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Permutation test for rda under reduced model
Marginal tests for axes
Permutation: free
Number of permutations: 999
Model: rda(formula = spp ~ treatment + Condition(block), data = env)
Df Variance F Pr(>F)
RDA1 1 284.81 5.1797 0.010 **
RDA2 1 30.83 0.5606 0.735
RDA3 1 14.20 0.2582 0.960
Residual 9 494.88
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
</figure>
<h2 id="chages-in-relative-seedling-composition">
Chages in relative seedling composition
</h2>
<p>
As mentioned earlier, interest is also, perhaps predominantly, in whether any of the treatments have different species composition. To test this hypothesis we standardise by the sample (row) norm using <code>decostand()</code>. Alternatively we could have used <code>method = ātotalā</code> to work with proportional abundances. We then repeat the earlier steps, this time using only model-based permutations owing to their greater power.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">spp.norm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">decostand</span><span class="p">(</span><span class="n">spp</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"normalize"</span><span class="p">,</span><span class="w"> </span><span class="n">MARGIN</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mod2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rda</span><span class="p">(</span><span class="n">spp.norm</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">treatment</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Condition</span><span class="p">(</span><span class="n">block</span><span class="p">),</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">env</span><span class="p">)</span><span class="w">
</span><span class="n">mod2</span><span class="w">
</span><span class="n">eigenvals</span><span class="p">(</span><span class="n">mod2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">mod2</span><span class="o">$</span><span class="n">tot.chi</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">76</span><span class="p">)</span><span class="w">
</span><span class="n">anova</span><span class="p">(</span><span class="n">mod2</span><span class="p">,</span><span class="w"> </span><span class="n">permutations</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">,</span><span class="w"> </span><span class="n">parallel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Call: rda(formula = spp.norm ~ treatment + Condition(block), data
= env)
Inertia Proportion Rank
Total 0.3726 1.0000
Conditional 0.0814 0.2184 3
Constrained 0.0725 0.1945 3
Unconstrained 0.2188 0.5871 9
Inertia is variance
Eigenvalues for constrained axes:
RDA1 RDA2 RDA3
0.04517 0.01718 0.01012
Eigenvalues for unconstrained axes:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
0.08026 0.07074 0.02860 0.01916 0.00989 0.00585 0.00223 0.00167 0.00038
RDA1 RDA2 RDA3 PC1 PC2 PC3
0.12123276 0.04610541 0.02716385 0.21539133 0.18983329 0.07675497
PC4 PC5 PC6 PC7 PC8 PC9
0.05140906 0.02655227 0.01570519 0.00597888 0.00447093 0.00101031
Permutation test for rda under reduced model
Permutation: free
Number of permutations: 999
Model: rda(formula = spp.norm ~ treatment + Condition(block), data = env)
Df Variance F Pr(>F)
Model 3 0.072475 0.9939 0.449
Residual 9 0.218768 </code></pre>
</figure>
<p>
The results suggest no difference in species composition under the experimental manipulation.
</p>
<p>
Thatās it for this post. In the next example Iāll take a look at a more complex example, one where model-based permutations canāt be used to test all the hypotheses we might want to in an experimental design.
</p>
<h2 id="references" class="unnumbered">
References
</h2>
<div id="refs" class="references">
<div id="ref-Smilauer2014-ac">
<p>
Å milauer, P., and LepÅ”, J. (2014). <em>Multivariate analysis of ecological data using CANOCO 5</em>. 2 edition. Cambridge University Press.
</p>
</div>
<div id="ref-Spackova1998-ad">
<p>
Å paÄkovĆ”, I., KotorovĆ”, I., and LepÅ”, J. (1998). Sensitivity of seedling recruitment to moss, litter and dominant removal in an oligotrophic wet meadow. <em>Folia geobotanica</em> 33, 17ā30. doi:<a href="https://doi.org/10.1007/BF02914928">10.1007/BF02914928</a>.
</p>
</div>
</div>
analogue 0.14-0 released
Gavin L. Simpson
2014-10-14T00:00:00-06:00
2014-10-14T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2014/10/14/analogue-0.14-0-now-available-on-CRAN/
<p>
A couple of weekās ago I packaged up a new release of <strong>analogue</strong>, which is available from <a href="http://cran.r-project.org/web/packages/analogue/index.html">CRAN</a>. Version 0.14-0 is a smaller update than the changes released in 0.12-0 and sees a continuation of the changes to dependencies to have packages in Imports rather than Depends. The main development of <strong>analogue</strong> now takes place on <a href="https://github.com/gavinsimpson/analogue/">github</a> and bugs and feature requests should be posted there. The Travis continuous integration system is used to automatically check the package as new code is checked in. There are several new functions and methods and a few bug fixes, the details of which are given below.
</p>
<p>
A couple of weekās ago I packaged up a new release of <strong>analogue</strong>, which is available from <a href="http://cran.r-project.org/web/packages/analogue/index.html">CRAN</a>. Version 0.14-0 is a smaller update than the changes released in 0.12-0 and sees a continuation of the changes to dependencies to have packages in Imports rather than Depends. The main development of <strong>analogue</strong> now takes place on <a href="https://github.com/gavinsimpson/analogue/">github</a> and bugs and feature requests should be posted there. The Travis continuous integration system is used to automatically check the package as new code is checked in. There are several new functions and methods and a few bug fixes, the details of which are given below.
</p>
<p>
The main user-visible change over 0.12-0 is the <strong>deprecation</strong> of the <code>plot3d.prcurve()</code> method. The functionality is now in new function <code>Plot3d()</code> and <code>plot3d.prcurve()</code> is deprecated and if called needs to use the full function name. This change is to make analogue easier to install on MacOS X as now <strong>rgl</strong> is not needed to install <strong>analogue</strong>. If you want to plot the principal curve in an interactive 3d view, youāll need to get <strong>rgl</strong> installed first.
</p>
<h2 id="new-features">
New features
</h2>
<ul>
<li>
<p>
<code>n2()</code> is a new utility function to calculate Hillās N2 for sites (sample) & species (variables).
</p>
</li>
<li>
<p>
<code>optima()</code> can now compute bootstrap WA optima and uncertainty.
</p>
</li>
<li>
<p>
<code>performance()</code> has a new method for objects of class<code>ācrossvalā</code>.
</p>
</li>
<li>
<p>
<code>timetrack()</code> had several improvements including a new <code>predict()</code> method, which allows further points to be added to an existing timetrack, a <code>points()</code> method to allow the addition of data to an existing timetrack plot, and the <code>plot()</code> method can create a blank plotting region allowing greater customisation.
</p>
</li>
<li>
<p>
<code>prcurve()</code> gets <code>predict()</code> and <code>fitted()</code> methods to predict locations of new samples on the principal curve and extract the locations of the training samples respectively.
</p>
</li>
<li>
<p>
<code>evenSample</code> is a utility function to look at the evenness of the distribution of samples along a gradient.
</p>
</li>
<li>
<p>
Data sets <code>Pollen</code>, <code>Biome</code>, <code>Climate</code>, and <code>Location</code> from the North American Modern Pollen Database have been updated to version 1.7.3.
</p>
</li>
</ul>
<h2 id="bug-fixes">
Bug fixes
</h2>
<ul>
<li>
<p>
The calculation of AUC in <code>roc()</code> wasnāt working correctly in some circumstances with just a couple of groups.
</p>
</li>
<li>
<p>
<code>crossval.pcr()</code> had a number of bugs in the k-fold CV routine which were leading to errors and the function not working.
</p>
<p>
The progress bar was not being updated correctly either.
</p>
</li>
<li>
<p>
<code>predict.pcr()</code> was setting argument <code>ncomp</code> incorrectly if not supplied by the user.
</p>
</li>
<li>
<p>
<code>ChiSquare()</code> wasnāt returning the transformation parameters required to transform leave-out data during crossvalidation or new samples for which predictions were required.
</p>
</li>
<li>
<p>
<code>plot3d.prcurve()</code> was not using the data and ordination components of the returned object. Note this function is now deprecated.
</p>
</li>
<li>
<p>
<code>predict.pcr()</code> was incorrectly calling the internal function <code>fitPCR</code> with the <code>:::</code> operator.
</p>
</li>
</ul>
<h2 id="deprecated">
Deprecated
</h2>
<ul>
<li>
<code>plot3d.prcurve()</code> is deprecated. Functionality is in new function <code>Plot3d()</code>. <strong>Note</strong>: in the next version of <strong>analogue</strong>, this functionality will be removed entirely and located in a new package <a href="https://github.com/gavinsimpson/analogueExtra"><strong>analogueExtra</strong></a>.
</li>
</ul>
Simulating species abundance data with coenocliner
Gavin L. Simpson
2014-07-31T00:00:00-06:00
2014-07-31T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2014/07/31/simulating-species-abundance-data-with-the-coenocliner-package/
<p>
Coenoclines are, according to the Oxford Dictionary of Ecology <span class="citation" data-cites="Allaby1998">(Allaby, 1998)</span>, <em>āgradients of communities (e.g.Ā in a transect from the summit to the base of a hill), reflecting the changing importance, frequency, or other appropriate measure of different species populationsā</em>. In much ecological research, and that of related fields, data on these coenoclines are collected and analyzed in a variety of ways. When developing new statistical methods or when trying to understand the behaviour of existing methods, we often resort to simulating data with known pattern or structure and then torture whatever method is of interest with the simulated data to tease out how well methods work or where they breakdown. Thereās a long history of using computers to simulate species abundance data along coenoclines but until recently no <strong>R</strong> packages were available that performed coenocline simulation. <strong>coenocliner</strong> was designed to fill this gap, and today, the package was <a href="http://cran.r-project.org/web/packages/coenocliner/index.html">released to CRAN</a>.
</p>
<div id="refs" class="references">
<div id="ref-Allaby1998">
<p>
Allaby, M. (1998). <em>A dictionary of ecology</em>. second. Oxford University Press.
</p>
</div>
</div>
<p>
Coenoclines are, according to the Oxford Dictionary of Ecology <span class="citation" data-cites="Allaby1998">(Allaby, 1998)</span>, <em>āgradients of communities (e.g.Ā in a transect from the summit to the base of a hill), reflecting the changing importance, frequency, or other appropriate measure of different species populationsā</em>. In much ecological research, and that of related fields, data on these coenoclines are collected and analyzed in a variety of ways. When developing new statistical methods or when trying to understand the behaviour of existing methods, we often resort to simulating data with known pattern or structure and then torture whatever method is of interest with the simulated data to tease out how well methods work or where they breakdown. Thereās a long history of using computers to simulate species abundance data along coenoclines but until recently no <strong>R</strong> packages were available that performed coenocline simulation. <strong>coenocliner</strong> was designed to fill this gap, and today, the package was <a href="http://cran.r-project.org/web/packages/coenocliner/index.html">released to CRAN</a>.
</p>
<p>
<strong>coenocliner</strong> can simulate species abundance or occurrence data along one or two gradients from either a Gaussian or generalised beta response model. Parameters for the response model are supplied for each species and parameterised species repsonse curves along the gradients are returned. Simulated abundance or occurrence data can be produced by sampling from one of several error distributions which use the parameterised species response curves as the expected count or probability of occurrence for the chosen error distribution. The available error distributions are
</p>
<ul>
<li>
Poisson
</li>
<li>
Negative binomial
</li>
<li>
Bernoulli (occurrence; Binomial with denominator (m = 1))
</li>
<li>
Binomial (counts with specified denominator (m))
</li>
<li>
Beta-binomial
</li>
<li>
Zero-inflated Poisson (ZIP)
</li>
<li>
Zero-inflated negative binomial (ZINB)
</li>
</ul>
<p>
You can find the <a href="https://github.com/gavinsimpson/coenocliner/">source code on github</a> and <a href="https://github.com/gavinsimpson/coenocliner/issues">report</a> any bugs or issues there. In the remainder of this posting I give an overview of <strong>coenocliner</strong> and show three examples illustrating features of package.
</p>
<h2 id="introduction-to-coenocliner">
Introduction to coenocliner
</h2>
<p>
To begin, load <strong>coenocliner</strong> and check the start-up message to see if you are using the current (0.1-0) release of the package
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s2">"coenocliner"</span><span class="p">)</span></code></pre>
</figure>
<p>
The main function in <strong>coenocliner</strong> is <code>coenocline()</code>, which provides a relatively simple interface to coenocline simulation allowing flexible specification of gradient locations and response model parameters for species. Gradient locations are specified via argument <code>x</code>, which can be a single vector, or, in the case of two gradients, a matrix or a list containing vectors of gradient values. The matrix version assumes the first gradientās values are in the first column and those for the second gradient in the second column
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">xy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">7</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">),</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span></code></pre>
</figure>
<p>
Similarly, for the list version, the first component contains the values for the first gradient and the second component the values for the second gradient
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">xy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">),</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">from</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">))</span></code></pre>
</figure>
<p>
The species response model used is indicated via the <code>responseModel</code> argument; available options are <code>āgaussianā</code> and <code>ābetaā</code> for the classic Gaussian response model and the generalise beta response model respectively. Parameters are supplied to <code>coenocline()</code> via the <code>params</code> argument. <code>showParams()</code> can be used list the parameters for the desired response model. The parameters for the Gaussian response model are
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">showParams</span><span class="p">(</span><span class="s2">"gaussian"</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">Species response model: Gaussian
Parameters:
[1] opt tol h*
Parameters marked with '*' are only supplied once</code></pre>
</figure>
<p>
As indicated, some parameters are only supplied once per species, regardless of whether there are one or two gradients. Hence for the Gaussian model, the parameter <code>h</code> is only supplied for the first gradient even if two gradients are required.
</p>
<p>
Parameters are supplied as a matrix with named columns, or as a list with named components. For example, for a Gaussian response for each of 3 species we could use either of the two forms
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">opt</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">tol</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">h</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="m">20</span><span class="p">,</span><span class="m">30</span><span class="p">)</span><span class="w">
</span><span class="n">parm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">opt</span><span class="p">,</span><span class="w"> </span><span class="n">tol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tol</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="c1"># matrix form</span><span class="w">
</span><span class="n">parl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">opt</span><span class="p">,</span><span class="w"> </span><span class="n">tol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tol</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="c1"># list form</span></code></pre>
</figure>
<p>
In the case of two gradients, a list with two components, one per gradient, is required. The first component contains parameters for the first gradient, the second element contains those for the second gradient. These components can be either a matrix or a list, as described previously. For example a list with parameters supplied as matrices
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">opty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">25</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">75</span><span class="p">)</span><span class="w">
</span><span class="n">tol</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">px</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parm</span><span class="p">,</span><span class="w">
</span><span class="n">py</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">opty</span><span class="p">,</span><span class="w"> </span><span class="n">tol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tol</span><span class="p">))</span></code></pre>
</figure>
<p>
Note that parameter (h) is not specified in the second set as this parameter, the height of the response curve at the gradient optima, applies globally ā in the case of two gradients, (h) refers to the height of the bell-shaped curve at the bivariate optimum.
</p>
<p>
Notice also how parameters are specified at the species level. To evaluate the response curve at the supplied gradient locations each set of parameters needs to be repeated for each gradient location. Thankfully <code>coenocline()</code> takes care of this detail for us.
</p>
<p>
Additional parameters that may be needed for the response model but which are not specified at the species level are supplied as a list with named components to argument <code>extraParams</code>. An example is the correlation between Gaussian response curves in case of two gradients. This, unfortunately, means that a single correlation between response curves applies to all species<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>, and is caused by a poor choice of implementation. Thankfully this is relatively easy to fix, which will be done for version 0.2-0 along with a fix for a similar issue relating to the statement of additional parameters for the error distribution used (see below).
</p>
<p>
To simulate realistic count data we need to sample <em>with error</em> from the parameterised species response curves. Which of the distributions (listed earlier) is used is specified via argument <code>countModel</code>; available options are
</p>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] "poisson" "negbin" "bernoulli" "binary"
[5] "binomial" "betabinomial" "ZIP" "ZINB" </code></pre>
</figure>
<p>
Some of these distributions (all bar <code>āpoissonā</code> and <code>ābernoulliā</code>) require additional arguments, such as the () parameter for (one parameterisation of) the negative binomial distribution. These arguments are supplied as a list with named components. Again, due to the same implementation snafu as for <code>extraParams</code>, such parameters act globally for all species<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>.
</p>
<p>
The final argument is <em>expectation</em>, which defaults to <code>FALSE</code>. When set to <code>TRUE</code>, simulating species counts or occurrences with error is skipped and the values of the parameterised response curve evaluated at the gradient locations are returned. This option is handy if you want to look at or plot the species response curves used in a simulation.
</p>
<h2 id="example-usage">
Example usage
</h2>
<p>
In the next few sections the basic usage of <code>coenocline()</code> is illustrated.
</p>
<h3 id="gaussian-responses-along-a-single-gradient">
Gaussian responses along a single gradient
</h3>
<p>
This example, of multiple species responses along a single environmental gradient, illustrates the simplest usage of <code>coenocline()</code>. The example uses a hypothetical pH gradient with species optima drawn at random uniformally along the gradient. Species tolerances are the same for all species. The maximum abundance of each species, (h), is drawn from a lognormal distribution with a mean of ~20 ((e^3)). This simulation will be for a community of 20 species, evaluated at 100 equally spaced locations. First we set up the parameters
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">M</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w"> </span><span class="c1"># number of species</span><span class="w">
</span><span class="n">ming</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">3.5</span><span class="w"> </span><span class="c1"># gradient minimum...</span><span class="w">
</span><span class="n">maxg</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="c1"># ...and maximum</span><span class="w">
</span><span class="n">locs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ming</span><span class="p">,</span><span class="w"> </span><span class="n">maxg</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w"> </span><span class="c1"># gradient locations</span><span class="w">
</span><span class="n">opt</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ming</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">maxg</span><span class="p">)</span><span class="w"> </span><span class="c1"># species optima</span><span class="w">
</span><span class="n">tol</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="n">M</span><span class="p">)</span><span class="w"> </span><span class="c1"># species tolerances</span><span class="w">
</span><span class="n">h</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">ceiling</span><span class="p">(</span><span class="n">rlnorm</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">meanlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w"> </span><span class="c1"># max abundances</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">opt</span><span class="p">,</span><span class="w"> </span><span class="n">tol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tol</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="c1"># put in a matrix</span></code></pre>
</figure>
<p>
As a check, before simulating any count data, we can look at the coenocline implied by these parameters by returning the expectations only from <code>coenocline()</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gaussian"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w">
</span><span class="n">expectation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<p>
This returns a matrix of values obtained by evaluating each species response curve at the supplied gradient locations. There is one column per species and one row per gradient location
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="nf">class</span><span class="p">(</span><span class="n">mu</span><span class="p">)</span><span class="w">
</span><span class="nf">dim</span><span class="p">(</span><span class="n">mu</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">mu</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">])</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] "matrix"</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 100 20</code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.088 5.443e-20 1.433e-13 0.5025 1.461e-36 2.604e-38
[2,] 1.553 2.165e-19 4.414e-13 0.6938 9.370e-36 1.669e-37
[3,] 2.173 8.440e-19 1.333e-12 0.9391 5.892e-35 1.049e-36
[4,] 2.981 3.225e-18 3.945e-12 1.2460 3.631e-34 6.460e-36
[5,] 4.008 1.208e-17 1.144e-11 1.6203 2.194e-33 3.900e-35
[6,] 5.282 4.435e-17 3.254e-11 2.0655 1.299e-32 2.308e-34</code></pre>
</figure>
<p>
A quick way to visualise the parameterised species response is to use <code>matplot()</code><a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">matplot</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pH"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Abundance"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simulating-species-abundance-data-with-the-coenocliner-package-example1-plot-expectations.png" alt="Figure 1: Gaussian species response curves along a hypothetical pH gradient" />
<figcaption>
Figure 1: Gaussian species response curves along a hypothetical pH gradient
</figcaption>
</figure>
<p>
The resultant plot is shown in Figure 1.
</p>
<p>
As this looks OK, we can simulate some count data. The simplest model for doing so is to make random draws from a Poisson distribution with the mean, (), for each species set to value of the response curve evaluated at each gradient location. Hence the values in <code>mu</code> that we just created can be thought of as the expected count per species at each of the gradient locations we are interested in. To simulate Poisson count data, use <code>expectation = FALSE</code> or remove this argument from the call. To be more explicit, we should also state <code>countModel = āpoissonā</code><a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">simp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gaussian"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w">
</span><span class="n">countModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"poisson"</span><span class="p">)</span></code></pre>
</figure>
<p>
Again, <code>matplot</code> is useful in visualizing the simulated data
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">matplot</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">simp</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"p"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pH"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Abundance"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simulating-species-abundance-data-with-the-coenocliner-package-example1-plot-simulations.png" alt="Figure 2: Simulated species abundances with Poisson errors from Gaussian response curves along a hypothetical pH gradient" />
<figcaption>
Figure 2: Simulated species abundances with Poisson errors from Gaussian response curves along a hypothetical pH gradient
</figcaption>
</figure>
<p>
The resultant plot is shown in Figure 2 above.
</p>
<p>
Whilst the simulated counts look reasonable and follow the response curves in Figure there is a problem; the variation around the expected curves is too small. This is due to the error variance implied by the Poisson distribution encapsulating only that variance which would arise due to repeated sampling at the gradient locations. Most species abundance data exhibit much larger degrees of variation than that shown in Figure .
</p>
<p>
A solution to this is to sample from a distribution that incorporates additional variance or <em>overdispersion</em>. A natural partner to the Poisson that includes overdispersion is the negative binomial. To simulate count data using the negative binomial distribution we must alter <code>countModel</code> and supply the overdispersion parameter <span class="math inline">()</span> to use<a href="#fn5" class="footnote-ref" id="fnref5"><sup>5</sup></a> via <code>countParams</code>.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">simnb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gaussian"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w">
</span><span class="n">countModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"negbin"</span><span class="p">,</span><span class="w"> </span><span class="n">countParams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">))</span></code></pre>
</figure>
<p>
Using <code>matplot</code> it is apparent that the simluated species data are now far more relalistic (Figure 3)
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">matplot</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">simnb</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"p"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.8</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"pH"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Abundance"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simulating-species-abundance-data-with-the-coenocliner-package-example1-plot-nb-simulations.png" alt="Figure 3: Simulated species abundance with negative binomial errors from Gaussian response curves along a hypothetical pH gradient" />
<figcaption>
Figure 3: Simulated species abundance with negative binomial errors from Gaussian response curves along a hypothetical pH gradient
</figcaption>
</figure>
<h3 id="generalised-beta-responses-along-a-single-gradient">
Generalised beta responses along a single gradient
</h3>
<p>
In this example, I recreate figure 2 in <span class="citation" data-cites="Minchin1987">Minchin (1987)</span> and then simulate species abundances from the species response curves. The species parameters for the generalised beta response for the six species in <span class="citation" data-cites="Minchin1987">Minchin (1987)</span> are
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">A0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">7</span><span class="p">,</span><span class="m">5</span><span class="p">,</span><span class="m">9</span><span class="p">,</span><span class="m">8</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="c1"># max abundance</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">25</span><span class="p">,</span><span class="m">85</span><span class="p">,</span><span class="m">10</span><span class="p">,</span><span class="m">60</span><span class="p">,</span><span class="m">45</span><span class="p">,</span><span class="m">60</span><span class="p">)</span><span class="w"> </span><span class="c1"># location on gradient of modal abundance</span><span class="w">
</span><span class="n">r</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">3</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">6</span><span class="p">,</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">10</span><span class="w"> </span><span class="c1"># species range of occurence on gradient</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">1.5</span><span class="p">,</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="c1"># shape parameter</span><span class="w">
</span><span class="n">gamma</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.1</span><span class="p">,</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">0.5</span><span class="p">,</span><span class="m">4</span><span class="p">)</span><span class="w"> </span><span class="c1"># shape parameter</span><span class="w">
</span><span class="n">locs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">100</span><span class="w"> </span><span class="c1"># gradient locations</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w">
</span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">A0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A0</span><span class="p">)</span><span class="w"> </span><span class="c1"># species parameters, in list form</span></code></pre>
</figure>
<p>
To recreate figure 2 in <span class="citation" data-cites="Minchin1987">Minchin (1987)</span> evaluations at the chosen gradient locations, <code>locs</code>, of the parameterised generalised beta are required and can be generated by passing <code>coenocline()</code> the gradient locations and the chosen species parameters as before, choosing the generalised beta response model and using <code>expectation = TRUE</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"beta"</span><span class="p">,</span><span class="w"> </span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">expectation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<p>
As before <code>mu</code> is a matrix with one column per species
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">head</span><span class="p">(</span><span class="n">mu</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 44.52 0 0.5913 0
[2,] 0 0 49.39 0 1.6582 0
[3,] 0 0 53.90 0 3.0199 0
[4,] 0 0 57.97 0 4.6085 0
[5,] 0 0 61.52 0 6.3828 0
[6,] 0 0 64.51 0 8.3138 0</code></pre>
</figure>
<p>
and as such we can use <code>matplot()</code> to draw the species responses
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">matplot</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gradient"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Abundance"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simulating-species-abundance-data-with-the-coenocliner-package-example2-beta-plot-expectations.png" alt="Figure 4: Generalised beta function species response curves along a hypothetical environmental gradient recreating Figure 2 in Minchin (1987)." />
<figcaption>
Figure 4: Generalised beta function species response curves along a hypothetical environmental gradient recreating Figure 2 in Minchin (1987).
</figcaption>
</figure>
<p>
Figure 4 is a good facsimile of figure 2 in <span class="citation" data-cites="Minchin1987">Minchin (1987)</span>.
</p>
<h3 id="gaussian-response-along-two-gradients">
Gaussian response along two gradients
</h3>
<p>
In this example I illustrate how to simulate species abundance in an environment comprising two gradients. Parameters for the simulation are defined first, including the number of species and samples required, followed by definitions of the gradient units and lengths, species optima, and tolerances for each gradient, and the maximal abundance ((h)).
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">30</span><span class="w"> </span><span class="c1"># number of samples</span><span class="w">
</span><span class="n">M</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w"> </span><span class="c1"># number of species</span><span class="w">
</span><span class="c1">## First gradient</span><span class="w">
</span><span class="n">ming1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">3.5</span><span class="w"> </span><span class="c1"># 1st gradient minimum...</span><span class="w">
</span><span class="n">maxg1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="c1"># ...and maximum</span><span class="w">
</span><span class="n">loc1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ming1</span><span class="p">,</span><span class="w"> </span><span class="n">maxg1</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w"> </span><span class="c1"># 1st gradient locations</span><span class="w">
</span><span class="n">opt1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ming1</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">maxg1</span><span class="p">)</span><span class="w"> </span><span class="c1"># species optima</span><span class="w">
</span><span class="n">tol1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">M</span><span class="p">)</span><span class="w"> </span><span class="c1"># species tolerances</span><span class="w">
</span><span class="n">h</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">ceiling</span><span class="p">(</span><span class="n">rlnorm</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">meanlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w"> </span><span class="c1"># max abundances</span><span class="w">
</span><span class="n">par1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">opt1</span><span class="p">,</span><span class="w"> </span><span class="n">tol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tol1</span><span class="p">,</span><span class="w"> </span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">h</span><span class="p">)</span><span class="w"> </span><span class="c1"># put in a matrix</span><span class="w">
</span><span class="c1">## Second gradient</span><span class="w">
</span><span class="n">ming2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="c1"># 2nd gradient minimum...</span><span class="w">
</span><span class="n">maxg2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="c1"># ...and maximum</span><span class="w">
</span><span class="n">loc2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ming2</span><span class="p">,</span><span class="w"> </span><span class="n">maxg2</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w"> </span><span class="c1"># 2nd gradient locations</span><span class="w">
</span><span class="n">opt2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ming2</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">maxg2</span><span class="p">)</span><span class="w"> </span><span class="c1"># species optima</span><span class="w">
</span><span class="n">tol2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">ceiling</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">))</span><span class="w"> </span><span class="c1"># species tolerances</span><span class="w">
</span><span class="n">par2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">opt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">opt2</span><span class="p">,</span><span class="w"> </span><span class="n">tol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tol2</span><span class="p">)</span><span class="w"> </span><span class="c1"># put in a matrix</span><span class="w">
</span><span class="c1">## Last steps...</span><span class="w">
</span><span class="n">pars</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">px</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">par1</span><span class="p">,</span><span class="w"> </span><span class="n">py</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">par2</span><span class="p">)</span><span class="w"> </span><span class="c1"># put parameters into a list</span><span class="w">
</span><span class="n">locs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">loc1</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">loc2</span><span class="p">)</span><span class="w"> </span><span class="c1"># put gradient locations together</span></code></pre>
</figure>
<p>
Notice how the parameter sets for each gradient are individual matrices which are combined in a list, <code>pars</code>, ready for use. Also different this time is the <code>expand.grid()</code> call which is used to generate all pairwise combinations of the locations on the two gradients. This has the effect of creating a coordinate pair on the two gradients at which weāll evaluate the response curves. In effect this creates a grid of points over the gradient space.
</p>
<p>
Having set up the parameters, the call to <code>coenocline()</code> is the same as before, except now we specify a degree of correlation between the two gradients via <code>extraParams = list(corr = 0.5)</code>
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">mu2d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gaussian"</span><span class="p">,</span><span class="w">
</span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">extraParams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">),</span><span class="w">
</span><span class="n">expectation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<p>
<code>mu2d</code> now contains a matrix of expected species abundances, one column per species as before. Because of the way the <code>expand.grid()</code> function works, the ordering of species abudances in each column has the first gradient locations varying fastest ā the locations on the first gradient are repeated in order for each location on the second gradient
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">head</span><span class="p">(</span><span class="n">locs</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> x y
1 3.500 1
2 3.621 1
3 3.741 1
4 3.862 1
5 3.983 1
6 4.103 1</code></pre>
</figure>
<p>
As a result, we can reshape the abundances for a single species into a matrix reflecting the grid of locations over the gradient space via a simple <code>matrix()</code> call, setting the number of columns in the resultant matrix equal to the number of gradient locations in the simulation. By way of illustration, this approach is used to prepare the expected abundances for four of the species in <code>mu2d</code> for plotting via the <code>persp()</code> plotting function
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">layout</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">op</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">par</span><span class="p">(</span><span class="n">mar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">8</span><span class="p">,</span><span class="m">13</span><span class="p">,</span><span class="m">19</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">persp</span><span class="p">(</span><span class="n">loc1</span><span class="p">,</span><span class="w"> </span><span class="n">loc2</span><span class="p">,</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">mu2d</span><span class="p">[,</span><span class="w"> </span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">loc2</span><span class="p">)),</span><span class="w">
</span><span class="n">ticktype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"detailed"</span><span class="p">,</span><span class="w"> </span><span class="n">zlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Abundance"</span><span class="p">,</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">op</span><span class="p">)</span><span class="w">
</span><span class="n">layout</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simulating-species-abundance-data-with-the-coenocliner-package-example3-persp-plots.png" alt="Figure 5: Bivariate Gaussian species responses for four selected species." />
<figcaption>
Figure 5: Bivariate Gaussian species responses for four selected species.
</figcaption>
</figure>
<p>
The selected species response curves are shown in Figure 5.
</p>
<p>
Simulated counts for each species can be produced by removing <code>expectation = TRUE</code> from the call and choosing an error distribution to make random draws from. For example, for negative binomial errors with dispersion (= 1) we can use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">sim2d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coenocline</span><span class="p">(</span><span class="n">locs</span><span class="p">,</span><span class="w"> </span><span class="n">responseModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"gaussian"</span><span class="p">,</span><span class="w">
</span><span class="n">params</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pars</span><span class="p">,</span><span class="w"> </span><span class="n">extraParams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">),</span><span class="w">
</span><span class="n">countModel</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"negbin"</span><span class="p">,</span><span class="w"> </span><span class="n">countParams</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span></code></pre>
</figure>
<p>
The resulting simulated counts for the same four selected species are shown in Figure 6, which was generated using the code below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">layout</span><span class="p">(</span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">op</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">par</span><span class="p">(</span><span class="n">mar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">8</span><span class="p">,</span><span class="m">13</span><span class="p">,</span><span class="m">19</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">persp</span><span class="p">(</span><span class="n">loc1</span><span class="p">,</span><span class="w"> </span><span class="n">loc2</span><span class="p">,</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">sim2d</span><span class="p">[,</span><span class="w"> </span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">loc2</span><span class="p">)),</span><span class="w">
</span><span class="n">ticktype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"detailed"</span><span class="p">,</span><span class="w"> </span><span class="n">zlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Abundance"</span><span class="p">,</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">45</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">op</span><span class="p">)</span><span class="w">
</span><span class="n">layout</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simulating-species-abundance-data-with-the-coenocliner-package-example3-persp-plots2.png" alt="Figure 6: Simulated counts using negative binomial errors from bivariate Gaussian species responses for four selected species." />
<figcaption>
Figure 6: Simulated counts using negative binomial errors from bivariate Gaussian species responses for four selected species.
</figcaption>
</figure>
<div id="refs" class="references">
<div id="ref-Allaby1998">
<p>
Allaby, M. (1998). <em>A dictionary of ecology</em>. second. Oxford University Press.
</p>
</div>
<div id="ref-Minchin1987">
<p>
Minchin, P. R. (1987). Simulation of multidimensional community patterns: Towards a comprehensive model. <em>Vegetatio</em> 71, 145ā156.
</p>
</div>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
This is not strictly true as you can work out how the species parameters are replicated relative to gradient values and hence pass a vector of the correct length with the species-specific values included. Study the outputs from <code>expand()</code> when supplied gradient locations and parameters to work out how to specify <code>extraParams</code> appropriately<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
Again, this is not strictly true as you can work out how the species parameters are replicated relative to gradient values and hence pass a vector of the correct length with the species-specific values included. Study the outputs from <code>expand()</code> when supplied gradient locations and parameters to work out how to specify <code>countParams</code> appropriately<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn3">
<p>
until such a time as the <strong>coenocliner</strong> has a <code>plot</code> methodā¦<a href="#fnref3" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn4">
<p>
<code>countModel = āpoissonā</code> is the default so this can be excluded from the call.<a href="#fnref4" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn5">
<p>
Recall that this is only easily specifiable globally in version 0.1-0 of <strong>coenocliner</strong>.<a href="#fnref5" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Confidence intervals for derivatives of splines in GAMs
Gavin L. Simpson
2014-06-16T00:00:00-06:00
2014-06-16T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/
<p>
<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">Last time out</a> I looked at one of the complications of time series modelling with smoothers; you have a non-linear trend which may be statistically significant but it may not be increasing or decreasing everywhere. How do we identify where in the series the data are changing? In that post I explained how we can use the first derivatives of the model splines for this purpose, and used the method of finite differences to estimate them. To assess statistical significance of the derivative (the rate of change) I relied upon asymptotic normality and the usual pointwise confidence interval. That interval is fine if looking at just one point on the spline (not of much practical use), but when considering more points at once we have a multiple comparisons issue. Instead, a simultaneous interval is required, and for that we need to revisit a technique I <a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">blogged about a few years ago</a>; posterior simulation from the fitted GAM.
</p>
<p>
<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">Last time out</a> I looked at one of the complications of time series modelling with smoothers; you have a non-linear trend which may be statistically significant but it may not be increasing or decreasing everywhere. How do we identify where in the series the data are changing? In that post I explained how we can use the first derivatives of the model splines for this purpose, and used the method of finite differences to estimate them. To assess statistical significance of the derivative (the rate of change) I relied upon asymptotic normality and the usual pointwise confidence interval. That interval is fine if looking at just one point on the spline (not of much practical use), but when considering more points at once we have a multiple comparisons issue. Instead, a simultaneous interval is required, and for that we need to revisit a technique I <a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">blogged about a few years ago</a>; posterior simulation from the fitted GAM.
</p>
<p>
To get a headstart on this, Iāll reuse the model we fitted to the <abbr title="Central England Temperature">CET</abbr> time series from the previous post. Just copy and paste the code below into your R session
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## Load the CET data and process as per other blog post</span><span class="w">
</span><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="s2">"https://gist.github.com/gavinsimpson/b52f6d375f57d539818b/raw/2978362d97ee5cc9e7696d2f36f94762554eefdf/load-process-cet-monthly.R"</span><span class="p">,</span><span class="w">
</span><span class="n">tmpf</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wget"</span><span class="p">)</span><span class="w">
</span><span class="n">source</span><span class="p">(</span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="c1">## Load mgcv and fit the model</span><span class="w">
</span><span class="n">require</span><span class="p">(</span><span class="s2">"mgcv"</span><span class="p">)</span><span class="w">
</span><span class="n">ctrl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">niterEM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">msVerbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">optimMethod</span><span class="o">=</span><span class="s2">"L-BFGS-B"</span><span class="p">)</span><span class="w">
</span><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cc"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Time</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="o">|</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">)</span><span class="w">
</span><span class="c1">## prediction data</span><span class="w">
</span><span class="n">want</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">cet</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">cet</span><span class="p">,</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">Time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Time</span><span class="p">[</span><span class="n">want</span><span class="p">],</span><span class="w"> </span><span class="n">Date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Date</span><span class="p">[</span><span class="n">want</span><span class="p">],</span><span class="w">
</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nMonth</span><span class="p">[</span><span class="n">want</span><span class="p">]))</span></code></pre>
</figure>
<p>
Here, Iāll use a version of the <code>Deriv()</code> function used in the last post modified to do the posterior simulation; <code>derivSimulCI()</code>. Letās load that too
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## download the derivatives gist</span><span class="w">
</span><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="s2">"https://gist.githubusercontent.com/gavinsimpson/ca18c9c789ef5237dbc6/raw/295fc5cf7366c831ab166efaee42093a80622fa8/derivSimulCI.R"</span><span class="p">,</span><span class="w">
</span><span class="n">tmpf</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wget"</span><span class="p">)</span><span class="w">
</span><span class="n">source</span><span class="p">(</span><span class="n">tmpf</span><span class="p">)</span></code></pre>
</figure>
<h2 id="posterior-simulation">
Posterior simulation
</h2>
<p>
The sorts of GAMs fitted by <code>mgcv::gam()</code> are, if we assume normally distributed errors, really just a linear regression. Instead of being a linear model in the original data however, the linear model is fitted using the basis functions as the covariates<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a>. As with any other linear model, we get back from it the point estimate, the ( _j ), and their standard errors. Consider the simple linear regression of <em>x</em> on <em>y</em>. Such a model has two terms
</p>
<ol type="1">
<li>
the constant term (the model intercept), and
</li>
<li>
the effect on <em>y</em> of a unit change in <em>x</em>.
</li>
</ol>
<p>
In fitting the model we get a point estimate for each term, plus their standard errors in the form of the variance-covariance (VCOV) matrix of the terms<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a>. Taken together, the point estimates of the model terms and the VCOV describe a multivariate normal distribution. In the case of the simple linear regression, this is a bivariate normal. Note that the point estimates are known as the <em>mean vector</em> of the multivariate normal; each point estimate is the mean, or expectation, of a single random normal variable whose variance is given by the standard error of the point estimate.
</p>
<p>
Computers are good at simulating data and youāll most likely be familiar with <code>rnorm()</code> to generate random, normally distributed values from a distribution with mean 0 and unit standard deviation. Well, simulating from a multivariate normal is just as simple<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a>, as long as you have the mean vector and the variance covariance matrix of the parameters.
</p>
<p>
Returning to the simple linear regression case, letās do a little simulation from a known model and look at the multivariate normal distribution of the model parameters.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">))</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1.45</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="c1">## sort dat on x to make things easier later</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">dat</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">mod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">)</span></code></pre>
</figure>
<p>
The mean vector for the multivariate normal is just the set of model coefficients for <code>mod</code>, which are extracted using the <code>coef()</code> function, and the <code>vcov()</code> function is used to extract the VCOV of the fitted model.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">(Intercept) x
4.412706 1.499317 </code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="p">(</span><span class="n">vc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vcov</span><span class="p">(</span><span class="n">mod</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> (Intercept) x
(Intercept) 0.44563330 -0.033760188
x -0.03376019 0.003114669</code></pre>
</figure>
<p>
Remember, the standard error is the square root of the diagonal elements of the VCOV
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">mod</span><span class="p">))[,</span><span class="w"> </span><span class="s2">"Std. Error"</span><span class="p">]</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">(Intercept) x
0.66755771 0.05580922 </code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">diag</span><span class="p">(</span><span class="n">vc</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">(Intercept) x
0.66755771 0.05580922 </code></pre>
</figure>
<p>
The multivariate normal distribution is not part of the base R distributions set. Several implementations are available in a range of packages, but here Iāll use the one in the <strong>MASS</strong> package which ships with all versions of R. To draw a nice plot, Iāll simulate a large number of values but weāll just show the first few below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">require</span><span class="p">(</span><span class="s2">"MASS"</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">nsim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5000</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">sim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mvrnorm</span><span class="p">(</span><span class="n">nsim</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">mod</span><span class="p">),</span><span class="w"> </span><span class="n">Sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vc</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">sim</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> (Intercept) x
[1,] 4.398392 1.476528
[2,] 4.536195 1.496449
[3,] 5.327903 1.426689
[4,] 4.810953 1.446152
[5,] 4.215752 1.509888
[6,] 4.153201 1.528336</code></pre>
</figure>
<p>
Each row of <code>sim</code> contains a pair of values, one intercept and one ( _x ), from the implied multivariate normal. The models implied by each row are all consistent with the fitted model. To visualize the multivariate normal for <code>mod</code> Iāll use a bivariate kernel density estimate to estimate the density of points over a grid of simulated intercept and slope values
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">kde</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">kde2d</span><span class="p">(</span><span class="n">sim</span><span class="p">[,</span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">sim</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">75</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">sim</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">19</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"darkgrey"</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">contour</span><span class="p">(</span><span class="n">kde</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">kde</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">kde</span><span class="o">$</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">add</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">drawlabels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-confidence-intervals-for%20derivatives-contour-plot-1.png" alt="5000 random draws from the posterior distribution of the parameters of the fitted additive model. Contours are for a 2d kernel destiny estimate of the points." />
<figcaption>
5000 random draws from the posterior distribution of the parameters of the fitted additive model. Contours are for a 2d kernel destiny estimate of the points.
</figcaption>
</figure>
<p>
The large spread in the points (from top left to bottom right) is illustrative of greater uncertainty in the intercept term than in ( _x ).
</p>
<p>
As I said earlier, each point on the plot represents a valid model consistent with the estimates we achieved for the sample of data used to fit the model. If we were to multiple the second column of <code>sim</code> with the observed data and add on the first column of <code>sim</code>, weād obtain fitted values for the observed <code>x</code> values for 5000 simulations from the fitted model as shown in the plot below
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">take</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">sim</span><span class="p">),</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w"> </span><span class="c1">## take 50 simulations at random</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">fits</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">sim</span><span class="p">[</span><span class="n">take</span><span class="p">,</span><span class="w"> </span><span class="p">])</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">matlines</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">fits</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"#A9A9A97D"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">abline</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">matlines</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">interval</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"confidence"</span><span class="p">)[,</span><span class="m">-1</span><span class="p">],</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-confidence-intervals-for%20derivatives-linear-regression-plus-simulations-plot-1.png" alt="Fitted linear model and 50 posterior simulations (grey band) and 95% point-wise confidence interval (red dashes)" />
<figcaption>
Fitted linear model and 50 posterior simulations (grey band) and 95% point-wise confidence interval (red dashes)
</figcaption>
</figure>
<p>
The grey lines show the model fits for a random sample of 50 pairs of coefficients from the set of simulated values.
</p>
<h2 id="posterior-simulation-for-additive-models">
Posterior simulation for additive models
</h2>
<p>
Youāll be pleased to know that there is very little difference (non really) between what I just went through above for a simple linear regression and what is required to simulate from the posterior distribution of a GAM. However, instead of dealing with two or just a few regression coefficients, we now have to concern ourselves with the potentially larger number of coefficients corresponding to the basis functions that combine to form the fitted splines. The only practical difference is that instead of multiplying each simulation by the observed data<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a> with <strong>mgcv</strong> we generate the linear predictor matrix for the observations and multiply that by the model coefficients to get simulations. If youāve read the <a href="/2014/05/15/identifying-periods-of-change-with-gams/">previous post</a> you should be somewhat familiar with the <code>lpmatrix</code> now.
</p>
<p>
Before we get to posterior simulations for the derivatives of the CET additive model fitted earlier, letās look at some simulations for the trend term in that model, <code>m2</code>. If you look back at an earlier code block, I created a grid of 200 points over the range of the data which weāll use to evaluate properties of the fitted model. This is in object <code>pdat</code>. First we generate the linear predictor matrix using <code>predict()</code> and grab the model coefficients and the variance covariance matrix of the coefficients
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">lp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m2</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m2</span><span class="o">$</span><span class="n">gam</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">vc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vcov</span><span class="p">(</span><span class="n">m2</span><span class="o">$</span><span class="n">gam</span><span class="p">)</span></code></pre>
</figure>
<p>
Next, generate a small sample from the posterior of the model, just for the purposes of illustration; weāll generate far larger samples later when we estimate a confidence interval on the derivatives of the trend spline.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">35</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">sim</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mvrnorm</span><span class="p">(</span><span class="m">25</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">coefs</span><span class="p">,</span><span class="w"> </span><span class="n">Sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">vc</span><span class="p">)</span></code></pre>
</figure>
<p>
The linear predictor matrix, <code>lp</code>, has a column for every basis function, plus the constant term, in the model, but because the model is additive we can ignore the columns relating to the <code>nMonth</code> spline and the constant term and just work with the coefficients and columns of <code>lp</code> that pertain to the trend spline. Letās identify those
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">want</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grep</span><span class="p">(</span><span class="s2">"Time"</span><span class="p">,</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">lp</span><span class="p">))</span></code></pre>
</figure>
<p>
Again, a simple bit of matrix multiplication gets us fitted values for the trend spline only
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fits</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lp</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">]</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">sim</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">])</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="nf">dim</span><span class="p">(</span><span class="n">fits</span><span class="p">)</span><span class="w"> </span><span class="c1">## 25 columns, 1 per simulation, 200 rows, 1 per evaln point</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] 200 25</code></pre>
</figure>
<p>
We can now draw out each of these posterior simulations as follows
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">ylims</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">range</span><span class="p">(</span><span class="n">fits</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">19</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylims</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">matlines</span><span class="p">(</span><span class="n">pdat</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">fits</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-confidence-intervals-for%20derivatives-cet-model-trend-posterior-simulations-1.png" alt="Posterior simulations for the trend spline of the additive model fitted to the CET time series" />
<figcaption>
Posterior simulations for the trend spline of the additive model fitted to the CET time series
</figcaption>
</figure>
<h2 id="posterior-simulation-for-the-first-derivatives-of-a-spline">
Posterior simulation for the first derivatives of a spline
</h2>
<p>
As we saw in the previous post, the linear predictor matrix can be used to generate finite differences-based estimates of the derivatives of a spline in a GAM fitted by <strong>mgcv</strong>. And as we just went through, we can combine posterior simulations with the linear predictor matrix. The main steps in the process of computing the finite differences and doing the posterior simulation are
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">X0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newDF</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">newDF</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">newDF</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">eps</span><span class="w">
</span><span class="n">X1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newDF</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">Xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">X1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X0</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">eps</span></code></pre>
</figure>
<p>
where two linear predictor matrices are created, offset from one another by a small amount <code>eps</code>, and differenced to get the slope of the spline, and
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nt</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Xi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">want</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grep</span><span class="p">(</span><span class="n">t.labs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">X1</span><span class="p">))</span><span class="w">
</span><span class="n">Xi</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xp</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">]</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xi</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">simu</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">])</span><span class="w"> </span><span class="c1"># derivatives</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
which loops over the terms in the model, selects the relevant columns from the differenced predictor matrix, and computes the derivatives by a matrix multiplication with the set of posterior simulations. <code>simu</code> is the matrix of random draws from the posterior, multivariate normal distribution of the fitted modelās parameters. Note that the code in <code>derivSimulCI()</code> is slightly different to this, but it does the same thing.
</p>
<p>
To cut to the chase then, here is the code required to generate posterior simulations for the first derivatives of the spline terms in an additive model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">derivSimulCI</span><span class="p">(</span><span class="n">m2</span><span class="p">,</span><span class="w"> </span><span class="n">samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span></code></pre>
</figure>
<p>
<code>fd</code> is a list, the first <em>n</em> terms of which relate the the <em>n</em> terms in the model. Here <em>n</em> = 2. The names of the first two components are the names of the terms referenced in the model formula used to fit the model
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">str</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">List of 5
$ nMonth :List of 2
$ Time :List of 2
$ gamModel:List of 31
..- attr(*, "class")= chr "gam"
$ eps : num 1e-07
$ eval : num [1:200, 1:2] 1 1.06 1.11 1.17 1.22 ...
..- attr(*, "dimnames")=List of 2
- attr(*, "class")= chr "derivSimulCI"</code></pre>
</figure>
<p>
As I havenāt yet written a <code>confint()</code> method, weāll need to compute the confidence interval by hand, which is no bad thing of course! We do this by by taking two extreme quantiles of the distribution of the 10,000 posterior simulations we generated for the first derivative <em>at each</em> of the 200 points we wanted to evaluate the derivative. One of the reasons I did 10,000 simulations is that for a 95% confidence interval we only need sort the simulated derivatives in ascending order and extract the 250th and the 9750th of these ordered values. In practice weāll let the <code>quantile()</code> function do the hard work
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">CI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lapply</span><span class="p">(</span><span class="n">fd</span><span class="p">[</span><span class="m">1</span><span class="o">:</span><span class="m">2</span><span class="p">],</span><span class="w">
</span><span class="o">+</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">x</span><span class="o">$</span><span class="n">simulations</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">quantile</span><span class="p">,</span><span class="w">
</span><span class="o">+</span><span class="w"> </span><span class="n">probs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="m">0.975</span><span class="p">)))</span></code></pre>
</figure>
<p>
<code>CI</code> is now a list with two components, each of which contains a matrix with two rows (the two probability quantiles we asked for) and 200 columns (the number of locations at which the first derivative was evaluated).
</p>
<p>
There is a <code>plot()</code> method, which by default produces plots of all the terms in the model and includes the <del>simultaneous</del> point-wise confidence interval as well
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span><span class="w"> </span><span class="n">sizer</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-confidence-intervals-for%20derivatives-plot-deriv-method-1.png" alt="First derivative of the seasonal and trend splines from the CET time series additive model. The grey band is a 95% simultaneous point-wise confidence interval. Sections of the spline where the confidence interval does not include zero are indicated by coloured sections." />
<figcaption>
First derivative of the seasonal and trend splines from the CET time series additive model. The grey band is a 95% <del>simultaneous</del> point-wise confidence interval. Sections of the spline where the confidence interval does not include zero are indicated by coloured sections.
</figcaption>
</figure>
<h2 id="wrapping-up">
Wrapping up
</h2>
<p>
<code>derivSimulCI()</code> computes the actual derivative as well as the derivatives for each simulation. Rather than rely upon the <code>plot()</code> method we could draw our own plot with the confidence interval. To extract the derivative of the fitted spline use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">fit.fd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fd</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="o">$</span><span class="n">deriv</span></code></pre>
</figure>
<p>
and then produce a plot with the actual derivative, the 95% <del>simultaneous</del> point-wise confidence interval, and 20 of the derivatives for the posterior simulations, we can use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">76</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">take</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">nrow</span><span class="p">(</span><span class="n">fd</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="o">$</span><span class="n">simulations</span><span class="p">),</span><span class="w"> </span><span class="m">20</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">pdat</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">fit.fd</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"l"</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">range</span><span class="p">(</span><span class="n">CI</span><span class="p">[[</span><span class="m">2</span><span class="p">]]),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">matlines</span><span class="p">(</span><span class="n">pdat</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">CI</span><span class="p">[[</span><span class="m">2</span><span class="p">]]),</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">matlines</span><span class="p">(</span><span class="n">pdat</span><span class="o">$</span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">fd</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="o">$</span><span class="n">simulations</span><span class="p">[,</span><span class="w"> </span><span class="n">take</span><span class="p">],</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"solid"</span><span class="p">,</span><span class="w">
</span><span class="o">+</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"grey"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/simultaneous-confidence-intervals-for%20derivatives-plot-deriv-ci-by-hand-1.png" alt="First derivative of the trend spline from the CET time series additive model. The red dashed lines enclose the 95% simultaneous point-wise confidence interval. Superimposed are the first derivatives of the splines for 20 randomly selected posterior simulations from the fitted spline." />
<figcaption>
First derivative of the trend spline from the CET time series additive model. The red dashed lines enclose the 95% <del>simultaneous</del> point-wise confidence interval. Superimposed are the first derivatives of the splines for 20 randomly selected posterior simulations from the fitted spline.
</figcaption>
</figure>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
It is a little bit more complex than this, of course. If you allow <code>gam()</code> to select the degree of smoothness then you need to fit a penalized regression. Plus, the time series models fitted to the CET data arenāt fitted via <code>gam()</code> but via <code>gamm()</code>, where we are using the observation that a penalized regression can be expressed as a linear mixed model, with random effects being used to represent some of the penalty terms. If you specify the degree of smoothing to use, these complications go away.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn2">
<p>
The (squares of the) standard errors are on the diagonal of the VCOV, with the relationship between pairs of parameters being contained in the off-diagonal elements.<a href="#fnref2" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn3">
<p>
in practice. I suspect it is not quite so simple if one had to sit down and implement itā¦<a href="#fnref3" class="footnote-back">ā©</a>
</p>
</li>
<li id="fn4">
<p>
or a set of new values at which you want to evaluate the confidence interval<a href="#fnref4" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>
Identifying periods of change in time series with GAMs
Gavin L. Simpson
2014-05-15T00:00:00-06:00
2014-05-15T00:00:00-06:00
https://www.fromthebottomoftheheap.net/2014/05/15/identifying-periods-of-change-with-gams/
<p>
In previous posts (<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">here</a> and <a href="/2011/07/21/smoothing-temporally-correlated-data/">here</a>) I looked at how generalized additive models (GAMs) can be used to model non-linear trends in time series data. In my previous <a href="/2014/05/09/modelling-seasonal-data-with-gam/">post</a> I extended the modelling approach to deal with seasonal data where we model both the within year (seasonal) and between year (trend) variation with separate smooth functions. One of the complications of time series modelling with smoothers is how to summarize the fitted model; you have a non-linear trend which may be statistically significant but it may not be increasing or decreasing everywhere. How do we identify where in the series that the data are changing? Thatās the topic of this post, in which Iāll use the method of finite differences to estimate the rate of change (slope) in the fitted smoother and, through some <strong>mgcv</strong> magic, use the information recorded in the fitted model to identify periods of statistically significant change in the time series.
</p>
<p>
In previous posts (<a href="/2011/06/12/additive-modelling-and-the-hadcrut3v-global-mean-temperature-series/">here</a> and <a href="/2011/07/21/smoothing-temporally-correlated-data/">here</a>) I looked at how generalized additive models (GAMs) can be used to model non-linear trends in time series data. In my previous <a href="/2014/05/09/modelling-seasonal-data-with-gam/">post</a> I extended the modelling approach to deal with seasonal data where we model both the within year (seasonal) and between year (trend) variation with separate smooth functions. One of the complications of time series modelling with smoothers is how to summarize the fitted model; you have a non-linear trend which may be statistically significant but it may not be increasing or decreasing everywhere. How do we identify where in the series that the data are changing? Thatās the topic of this post, in which Iāll use the method of finite differences to estimate the rate of change (slope) in the fitted smoother and, through some <strong>mgcv</strong> magic, use the information recorded in the fitted model to identify periods of statistically significant change in the time series.
</p>
<h2 id="catching-up">
Catching up
</h2>
<p>
First off, if you havenāt already done so, go read my <a href="/2014/05/09/modelling-seasonal-data-with-gam/">post on modelling seasonal data with GAMs</a> as it provides the background info on the data and the model that Iāll be looking at here but also explains how we fitted the model etc. Donāt worry; you donāt need to run all that code to follow this post as I have put the relevant data processing parts in a Github <a href="https://gist.github.com/gavinsimpson/b52f6d375f57d539818b">gist</a> that weāll download and <code>source()</code> shortly.
</p>
<p>
OK, now that youāve read the previous post we can beginā¦
</p>
<p>
To bring you up to speed, I put the bits of code from the previous post that we need here in a <a href="https://gist.github.com/gavinsimpson/b52f6d375f57d539818b">gist</a> on Github. To run this code you can simply do the following
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## Load the CET data and process as per other blog post</span><span class="w">
</span><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="s2">"https://gist.github.com/gavinsimpson/b52f6d375f57d539818b/raw/2978362d97ee5cc9e7696d2f36f94762554eefdf/load-process-cet-monthly.R"</span><span class="p">,</span><span class="w">
</span><span class="n">tmpf</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wget"</span><span class="p">)</span><span class="w">
</span><span class="n">source</span><span class="p">(</span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">ls</span><span class="p">()</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">[1] "annCET" "cet" "CET" "rn" "tmpf" "Years" </code></pre>
</figure>
<p>
The gist contains code to download and process the monthly Central England Temperature (CET) time series so that it is ready for analysis via <code>gamm()</code>. Next we fit an additive model with seasonal and trend smooth and an AR(2) process for the residuals; the code predicts from the model at 200 locations over the entire time series and generates a pointwise, approximate 95% confidence interval on the trend spline.
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">ctrl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">niterEM</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">msVerbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">optimMethod</span><span class="o">=</span><span class="s2">"L-BFGS-B"</span><span class="p">)</span><span class="w">
</span><span class="n">m2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gamm</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">nMonth</span><span class="p">,</span><span class="w"> </span><span class="n">bs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"cc"</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">12</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">s</span><span class="p">(</span><span class="n">Time</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">correlation</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">corARMA</span><span class="p">(</span><span class="n">form</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="m">1</span><span class="o">|</span><span class="n">Year</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ctrl</span><span class="p">)</span><span class="w">
</span><span class="n">want</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">cet</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">cet</span><span class="p">,</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">Time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Time</span><span class="p">[</span><span class="n">want</span><span class="p">],</span><span class="w"> </span><span class="n">Date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Date</span><span class="p">[</span><span class="n">want</span><span class="p">],</span><span class="w">
</span><span class="n">nMonth</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nMonth</span><span class="p">[</span><span class="n">want</span><span class="p">]))</span><span class="w">
</span><span class="n">p2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">m2</span><span class="o">$</span><span class="n">gam</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"terms"</span><span class="p">,</span><span class="w"> </span><span class="n">se.fit</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">p2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p2</span><span class="o">$</span><span class="n">fit</span><span class="p">[,</span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">se2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p2</span><span class="o">$</span><span class="n">se.fit</span><span class="p">[,</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">df.res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df.residual</span><span class="p">(</span><span class="n">m2</span><span class="o">$</span><span class="n">gam</span><span class="p">)</span><span class="w">
</span><span class="n">crit.t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qt</span><span class="p">(</span><span class="m">0.025</span><span class="p">,</span><span class="w"> </span><span class="n">df.res</span><span class="p">,</span><span class="w"> </span><span class="n">lower.tail</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">pdat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">transform</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w">
</span><span class="n">upper</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">crit.t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se2</span><span class="p">),</span><span class="w">
</span><span class="n">lower</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">(</span><span class="n">crit.t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">se2</span><span class="p">))</span></code></pre>
</figure>
<p>
Note that I didnāt compute the confidence interval last time out, but Iāll use it here to augment the plots of the trend produced later.
</p>
<h2 id="first-derivatives-and-finite-differences">
First derivatives and finite differences
</h2>
<p>
One measure of change in a system is that the rate of change of the system is non-zero. If we had a simple linear regression model for a trend, then the estimated rate of change would be ( ), the slope of the regression line. Then because weāre being all statistical we can ask questions such as whether the non-zero estimate we might obtain for the slope is distinguishable from zero given the uncertainty in the estimate. This slope or the estimate ( ) is the first derivative of the regression line. Technically, this is the instantaneous rate of change of the function that defines the line, an idea can be extended to any function, even one as potentially complex as a fitted spline function.
</p>
<p>
The problem we have though is that in general we donāt have an equation for the spline from which we can derive the derivatives. So how do we estimate the derivative of a spline function in our additive model? One solution is to use the method of finite differences, the essentials of which are displayed in the plots below
</p>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/identifying-periods-of-change-with-gams-finite-differences-plot.png" alt="Illustration of the finite differences approach to estimating derivatives of function" />
<figcaption>
Illustration of the finite differences approach to estimating derivatives of function
</figcaption>
</figure>
<p>
If you have heard of the <a href="http://en.wikipedia.org/wiki/Trapezoidal_rule"><em>trapezoidal rule</em></a> for approximating the integral of a function, finite differences will be very familiar; instead of being interested in the area under the function weāre interested in estimating the slope of the function at any point. Consider the situation in the left hand plot above. The thick curve is the fitted spline for the trend component of the additive model <code>m2</code>. Superimposed on this curve are five points, between pairs of which we can approximate the first derivative of the function. We know the distance along the <em>x</em> axis (time) between each pair of points and we can evaluate the value of the function on the <em>y</em> axis by predicting from <code>m2</code> for the times at the indicated time points. It is then trivial to compute the slope ( m ) of the lines connecting the points as the change in the <em>y</em> direction divided by the change in the <em>x</em> direction, or
</p>
<p>
[ m = ]
</p>
<p>
As you can see from the left hand plot, in some parts of the function the derivative is reasonably well approximated by this crude method using just five points, but in most places this approach either under or over estimates the derivative of the function. The solution to this problem is to evaluate the slope at more points that are located closer together on the function. The right hand plot above shows the finite difference method for 20 points, which for all intents and purposes produces an acceptable estimate of the first derivative of the function. We can increase the accuracy of our derivatives estimates by using points that are closer and closer together. At some point the points would be infinitely close and weād know the first derivative exactly, except with a computer we canāt get to that point using the finite difference method.
</p>
<p>
To sum up, we can approximate the first derivative of a fitted spline by choosing a set of points (p) on the function and another set of points (p ) positioned a very small distance (say +10<sup>-5</sup>) from the first set. Using the <code>predict()</code> method with <code>type = ālpmatrixā</code> plus some other <strong>mgcv</strong> magic we can evaluate the fitted trend spline at the locations (p) and (p ) and compute the change in the function between the pairs of points and <em>voilĆ </em> we have our estimate of the first derivative of the fitted spline.
</p>
<h2 id="confidence-intervals-on-derivatives">
Confidence intervals on derivativesā¦?
</h2>
<p>
Earlier, when discussing the simple linear regression line, I touched on the issue of error or uncertainty in the estimate of the slope parameter () and how we allow for this in deciding whether the estimated rate of change was different from zero (technically: whether zero was a likely value for the slope given the model). The same issue concerns us with the first derivatives of the fitted spline; we still want to know where on the function the function is changing sufficiently that we can distinguish this change from no-change given our uncertainty knowledge of the function.
</p>
<p>
Thankfully, with a little bit of magic from <strong>mgcv</strong> we can compute the uncertainty in the estimates of the first derivative of the spline. Iāve encapsulated this in some functions that are currently only available as a <a href="https://gist.github.com/gavinsimpson/e73f011fdaaab4bb5a30">gist</a> on Github. I intend to package these up at some point with other functions for working with GAM(M) time series models. Iām also grateful to Simon Wood, author of <strong>mgcv</strong>, who provided the initial code to compute the derivatives and which I have generalized somewhat to the user to specify particular terms and to identify the correct terms from the model without having to count columns in the smoother prediction matrix.
</p>
<p>
To load the functions into R from Github use
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="c1">## download the derivatives gist</span><span class="w">
</span><span class="n">tmpf</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tempfile</span><span class="p">()</span><span class="w">
</span><span class="n">download.file</span><span class="p">(</span><span class="s2">"https://gist.github.com/gavinsimpson/e73f011fdaaab4bb5a30/raw/82118ee30c9ef1254795d2ec6d356a664cc138ab/Deriv.R"</span><span class="p">,</span><span class="w">
</span><span class="n">tmpf</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"wget"</span><span class="p">)</span><span class="w">
</span><span class="n">source</span><span class="p">(</span><span class="n">tmpf</span><span class="p">)</span><span class="w">
</span><span class="n">ls</span><span class="p">()</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text"> [1] "annCET" "cet" "CET" "confint.Deriv"
[5] "crit.t" "ctrl" "Deriv" "df.res"
[9] "m2" "m2.d" "m2.dci" "m2.dsig"
[13] "op" "p2" "pdat" "plot.Deriv"
[17] "rn" "signifD" "take" "take2"
[21] "Term" "tmpf" "want" "Years"
[25] "ylab" "ylim" </code></pre>
</figure>
<p>
I donāt intend to explain all the code behind those functions, but the salient parts of the derivative computation are
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">X0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newDF</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">newDF</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">newDF</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">eps</span><span class="w">
</span><span class="n">X1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">mod</span><span class="p">,</span><span class="w"> </span><span class="n">newDF</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lpmatrix"</span><span class="p">)</span><span class="w">
</span><span class="n">Xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">X1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X0</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">eps</span></code></pre>
</figure>
<p>
Here, <code>newDF</code> is a data frame of points along <em>x</em> at which we wish to evaluate the derivative and <code>eps</code> is the distance along <em>x</em> we nudge the points to give us locations (p ). <code>type = ālpmatrixā</code> forces <code>predict()</code> to return a matrix which, when multiplied by the vector of model coefficients, returns values fror the linear predictor of the model. The useful thing about this representation is that we can derive information on the fits or confidence intervals on quantities derived from the fitted model. Here we subtract one <code>lpmatrix</code> from another and divide these values by <code>eps</code> and proceed to do inference on those āslopesā directly.
</p>
<p>
The other critical bit of code is
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_len</span><span class="p">(</span><span class="n">nt</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">Xi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">want</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">grep</span><span class="p">(</span><span class="n">t.labs</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">X1</span><span class="p">))</span><span class="w">
</span><span class="n">Xi</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xp</span><span class="p">[,</span><span class="w"> </span><span class="n">want</span><span class="p">]</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Xi</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">mod</span><span class="p">)</span><span class="w"> </span><span class="c1"># derivatives</span><span class="w">
</span><span class="n">df.sd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rowSums</span><span class="p">(</span><span class="n">Xi</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">mod</span><span class="o">$</span><span class="n">Vp</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">Xi</span><span class="p">)</span><span class="o">^</span><span class="m">.5</span><span class="w">
</span><span class="n">lD</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">deriv</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">se.deriv</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df.sd</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre>
</figure>
<p>
This is where we compute the actual derivative and their standard errors. This is done in a loop as we have to work separately with the columns of the <code>lpmatrix</code> that relate to each spline. The first three lines of code within the loop are housekeeping that handle this part, with <code>Xi</code> being the matrix into which we insert the relevant values from the <code>lpmatrix</code> for the current term and contains zeroes elsewhere. <code>Xi %*% coef(mod)</code> gives us values of the derivative by a matrix multiplication with the model coefficients. Because irrelevant columns are 0 we donāt need to subset the coefficient vector, but we could make this more efficient by avoiding the copies of <code>Xp</code> by doing some further subsetting steps if we wished to.
</p>
<p>
The variances for the terms relating to current spline are computed using <code>Xi %*% mod<span class="math inline">\(Vp * Xi</code> where <code>mod\)</span>Vp</code> is the covariance matrix of the (fixed effects) parameters of the GAM(M). These are summed and their square root take to give the standard error for the entire spline and not for each of the basis functions that comprise the spline. The last line of the loop just stores the computed derivatives and standard errors in a list that will form part of the object returned by the <code>Deriv()</code> function.
</p>
<p>
With that, most of the hard work is done. The <code>confint.Deriv()</code> method will compute confidence intervals for the derivatives from their standard errors and some information on the fitted GAM(M), the coverage of these intervals is given by ( 1 - ) with ( ) commonly set to 0.05, but this can be controlled via argument <code>alpha</code>. It is worth noting that these confidence intervals are of the usual pointwise flavour; they would be OK if you only looked at the interval for a single point in isolation but they canāt be correct when we look at the entire spline. The reason they canāt be correct for the entire spline is related to the issue of multiple comparisons and in effect, by looking at the entire spline we are making a lot of multiple comparisons of tests. We could compute simultaneous intervals but I havenāt written the functions to do it just yet; itās not difficult as we can simulate from the fitted model and derive simultaneous intervals that way. That will have to wait for another post at a future date though!
</p>
<p>
There are two further trivial functions contained in the gist:
</p>
<ul>
<li>
<p>
<code>signifD()</code> plows through the points at which we evaluated the derivative and looks for locations where the pointwise ( 1 - ) confidence interval doesnāt include zero. Where zero is contained within the confidence interval the function returns an <code>NA</code> and for points where zero isnāt included the value of the function (or the derivative, depending on what you supplied to <code>signifD()</code>) is returned.
</p>
The function gives you back a list with two components <code>incr</code> and <code>decr</code> which contain the locations where the estimated derivative is positive or negative, respectively, and zero is not contained in the confidence interval. As youāll see soon, we can use these two separate lists to colour the increasing a decreasing parts of the fitted spline.
</li>
<li>
<p>
<code>plot.Deriv()</code> is an S3 method for <code>plot()</code> which will plot the estimated derivatives and associated confidence intervals. You can do this for selected terms via argument <code>term</code>. The confidence interval can be displayed as lines or a solid polygon via argument <code>polygon</code>. Other graphical parameters can be passed along via <code>ā¦</code>.
</p>
<p>
The one interesting argument is <code>sizer</code>, which if set to <code>TRUE</code> will colour increasing parts of the spline in blue and in red for the decreasing parts (where zero is not contained in the confidence interval). These colours come from the <a href="http://www.unc.edu/~marron/DataAnalyses/SiZer_Intro.html">siZer</a> method.
</p>
</li>
</ul>
<h2 id="back-to-the-cet-example">
Back to the CET example
</h2>
<p>
With that taken care of, we can use the functions to compute derivatives, confidence intervals, and ancillary information for the trend spline in model <code>m2</code> with a few simple lines of code
</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="o">></span><span class="w"> </span><span class="n">Term</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"Time"</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">m2.d</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Deriv</span><span class="p">(</span><span class="n">m2</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">m2.dci</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">confint</span><span class="p">(</span><span class="n">m2.d</span><span class="p">,</span><span class="w"> </span><span class="n">term</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Term</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">m2.dsig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">signifD</span><span class="p">(</span><span class="n">pdat</span><span class="o">$</span><span class="n">p2</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">m2.d</span><span class="p">[[</span><span class="n">Term</span><span class="p">]]</span><span class="o">$</span><span class="n">deriv</span><span class="p">,</span><span class="w">
</span><span class="o">+</span><span class="w"> </span><span class="n">m2.dci</span><span class="p">[[</span><span class="n">Term</span><span class="p">]]</span><span class="o">$</span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">m2.dci</span><span class="p">[[</span><span class="n">Term</span><span class="p">]]</span><span class="o">$</span><span class="n">lower</span><span class="p">)</span></code></pre>
</figure>
<p>
Most of that is self-explanatory, but the <code>signifD()</code> call, in lieu of any documentation yet, deserves some explanation. The first argument is the the thing you want to return in the <code>incr</code> and <code>decr</code> components. In this case I use the contribution to the fitted values for the trend spline alone <code>pdat<span class="math inline">\(p2</code>. The argument <code>d</code> is the vector of derivatives for a single term in the model. The two remaining arguments<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> are <code>upper</code> and <code>lower</code> which need to be supplied with the upper and lower bounds on the confidence interval respectively. Extracting all of that is a bit redundant when we could pass in the object returned by <code>confint.Deriv()</code> but I was thinking more simply when I wrote <code>signifD()</code> in a few dark days of number crunching when people were waiting on me to deliver some results.</p> <p>A quick plot of the first derivative of the spline is achieved using</p> <figure> <img src="https://www.fromthebottomoftheheap.net/assets/img/posts/identifying-periods-of-change-with-gams-first-derivative-plot.png" alt="First derivative of fitted trend spline from the additive model with AR(2) errors" /><figcaption>First derivative of fitted trend spline from the additive model with AR(2) errors</figcaption> </figure> <p>From the plot it is clear that there are two periods of (statistically) significant change, both increases, as shown by the blue indicators. However, with a plot like the one above you really have to dial in your brain to thinking about an increasing (decreasing) trend in the data when the derivative line is above (below) the zero line, even if the derivative is decreasing (increasing). As such, I have found an alternative display that combines a plot of the estimated trend spline with periods of significant change indicated by a thicker line with or without the siZer colours. You can see examples of such plots in a <a href="/2013/10/23/time-series-plots-with-lattice-and-ggplot/">previous post</a>.</p> <p>That <a href="/2013/10/23/time-series-plots-with-lattice-and-ggplot/">post</a> used the <strong>lattice</strong> and <strong>ggplot2</strong> packages to draw the plots. I still find it easier to just bash out a base graphics plot for such things unless I need the faceting features offered by those packages. Below I take such an approach and build a base graphics plot up from a series of plotting calls that successively augment the plot, starting with the estimated trend spline and pointwise confidence interval, then two calls to superimpose the periods of significant increase and decrease (although the latter doesnāt actually draw anything in this plot)</p> <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="o">&gt;</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="nf">range</span><span class="p">(</span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">lower</span><span class="p">,</span><span class="w"> </span><span class="n">p2</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="n">degree</span><span class="o">*</span><span class="n">C</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="s2">":"</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">centred</span><span class="p">))</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">p2</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylim</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">p2</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">upper</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">lower</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">m2.dsig</span><span class="o">\)</span></span><span class="n">incr</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">āblueā</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">m2.dsig</span><span class="o"><span class="math inline">\(</span><span class="n">decr</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre> </figure> <figure> <img src="https://www.fromthebottomoftheheap.net/assets/img/posts/identifying-periods-of-change-with-gams-plot-trend-with-sizer.png" alt="Fitted trend showing periods of significant increase in temperature in blue." /><figcaption>Fitted trend showing periods of significant increase in temperature in blue.</figcaption> </figure> <p>When looking at a plot like this, it is always important to have in the back of your mind a picture of the original data and how variance is decomposed between the seasonal, trend and residual terms, as well as the <em>effect size</em> associated with the trend and seasonal splines. You get a very different impression of the magnitude of the trend from the plot shown below, which contains the original data as well as the fitted trend spline.</p> <figure class="highlight"> <pre><code class="language-r" data-lang="r"><span class="o">&gt;</span><span class="w"> </span><span class="n">plot</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Temperature</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"n"</span><span class="p">,</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">points</span><span class="p">(</span><span class="n">Temperature</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">Temperature</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cet</span><span class="p">,</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"lightgrey"</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.7</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">p2</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">upper</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">lower</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"dashed"</span><span class="p">)</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">m2.dsig</span><span class="o">\)</span></span><span class="n">incr</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">āblueā</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">lines</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">m2.dsig</span><span class="o">$</span><span class="n">decr</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Date</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">pdat</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">āredā</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code>
</pre>
</figure>
<figure>
<img src="https://www.fromthebottomoftheheap.net/assets/img/posts/identifying-periods-of-change-with-gams-plot-trend-with-sizer-and-data.png" alt="Fitted trend showing periods of significant increase in temperature in blue. The original data are shown as grey points. Note the low precision of the data early in the CET series." />
<figcaption>
Fitted trend showing periods of significant increase in temperature in blue. The original data are shown as grey points. Note the low precision of the data early in the CET series.
</figcaption>
</figure>
<p>
Yet, whilst small compared to the seasonal amplitude in temperature, the CET exhibits about a 1Ā° increase in mean temperature over the past 150 or so years. Not something to dismiss as trivial.
</p>
<h2 id="summing-up">
Summing up
</h2>
<p>
In this post Iāve attempted to explain the method of finite differences for estimating the derivatives of a function such as the splines of an additive model. This was illustrated using custom functions to extract information from the fitted model and make use of features of the <strong>mgcv</strong> package that allowed the estimation of standard errors for quantities derived from the fitted model, such as the first derivatives used in the post.
</p>
<p>
So far weāve looked at fitting an additive model to seasonal data employing two smoothers in an additive model, the use of cyclic splines to represent cyclical features of the data, and estimation of an appropriate ARMA correlation structure to account for serial dependence in the data. Here weāve seen how to identify periods of significant change in the trend term using the first derivative of the fitted spline. Where do we go from here?
</p>
<p>
One important improvement is to turn the pointwise confidence interval into a simultaneous one. There is one other feature potentially present in the data that we have thus far failed to address; we might reasonably expect that the seasonal pattern in temperature has changed with the increase in the level of the series over time. To fit a model that includes this possibility weāll need to investigate smooth functions of two (or more) variables as well as look in more detail at how to build more complex GAM(M) models with <strong>mgcv</strong>. Iāll see about addressing these two issues in future posts.
</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1">
<p>
Thereās a further argument, <code>eval</code>, which contains the value you wish to test for inclusion in the coverage of the confidence interval. By default this is set to <code>0</code>.<a href="#fnref1" class="footnote-back">ā©</a>
</p>
</li>
</ol>
</section>