# From the bottom of the heap the musings of a geographer

## Fitting GAMs with brms: part 1 a simple GAM

#### 21 April 2018 /posted in: R

Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the **mgcv** package. I use these models all the time in my research but recently we’ve been hitting the limits of the range of models that **mgcv** can fit. So I’ve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The **brms** package (Bürkner, 2017) is an excellent resource for modellers, providing a high-level R front end to a vast array of model types, all fitted using Stan. **brms** is the perfect package to go beyond the limits of **mgcv** because **brms** even uses the smooth functions provided by **mgcv**, making the transition easier. In this post I take a look at how to fit a simple GAM in **brms** and compare it with the same model fitted using **mgcv**.

Bürkner, P.-C. (2017). brms: An R package for bayesian multilevel models using Stan. *Journal of Statistical Software* 80, 1–28. doi:10.18637/jss.v080.i01.

## Comparing smooths in factor-smooth interactions II ordered factors

#### 14 December 2017 /posted in: R

In a previous post I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using `s()`

’s `by`

argument. When a common-or-garden factor variable is passed to `by`

, `gam()`

estimates a separate smooth for each *level* of the `by`

factor. Using the (Xp) matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the `by`

variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an *ordered* factor for the `by`

variable.

## First steps with MRF smooths

#### 19 October 2017 /posted in: R

One of the specialist smoother types in the **mgcv** package is the Markov Random Field (MRF) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. I’ve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result I’ve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial Regional Smoothing in R from The Pudding. In this post I take a quick look at how to use the MRF smooth in **mgcv** to model a discrete spatial data set from the US Census Bureau.

## Comparing smooths in factor-smooth interactions I by-variable smooths

#### 10 October 2017 /posted in: R

One of the really appealing features of the **mgcv** package for fitting GAMs is the functionality it exposes for fitting quite complex models, models that lie well beyond what many of us may have learned about what GAMs can do. One of those features that I use a lot is the ability to model the smooth effects of some covariate (x) in the different levels of a factor. Having estimated a separate smoother for each level of the factor, the obvious question is, which smooths are different? In this post I’ll take a look at one way to do this using `by`

-variable smooths.

## Fitting count and zero-inflated count GLMMs with mgcv

#### 04 May 2017 /posted in: R

A couple of days ago, Mollie Brooks and coauthors posted a preprint on BioRχiv illustrating the use of the **glmmTMB** R package for fitting zero-inflated GLMMs (Brooks et al., 2017). In the paper, **glmmTMB** is compared with several other GLMM-fitting packages. **mgcv** has recently gained the ability to fit a wider range of families beyond the exponential family of distributions, including zero-inflated Poisson models. **mgcv** can also fit simple GLMMs through a spline equivalent of a Gaussian random effect. So, whilst I was waiting on some Bayesian GAMs to finish sampling, I decided to see how **mgcv** compared against **glmmTMB** on the two examples used in the paper.

Brooks, M. E., Kristensen, K., Benthem, K. J. van, Magnusson, A., Berg, C. W., Nielsen, A., et al. (2017). Modeling Zero-Inflated count data with glmmTMB. *bioRxiv*, 132753. doi:10.1101/132753.

## Prediction intervals for GLMs part II Poisson GLMs

#### 01 May 2017 /posted in: R

One of my more popular answers on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). Comments, even on StackOverflow, aren’t a good place for a discussion so I thought I’d post something hereon my blog that went into a bit more detail as to why, for some common types of GLMs, prediction intervals aren’t that useful and require a lot more thinking about what they mean and how they should be calculated. I’ve broken it into two and in this, the second part, I look at Possion models.

## Prediction intervals for GLMs part I Binomial GLMs

#### 01 May 2017 /posted in: R

One of my more popular answers on StackOverflow concerns the issue of prediction intervals for a generalized linear model (GLM). My answer really only addresses how to compute confidence intervals for parameters but in the comments I discuss the more substantive points raised by the OP in their question. Lately there’s been a bit of back and forth between Jarrett Byrnes and myself about what a prediction “interval” for a GLM might mean. Comments, even on StackOverflow, aren’t a good place for a discussion so I thought I’d post something here that went into a bit more detail as to why, for some common types of GLMs, prediction intervals aren’t that useful and require a lot more thinking about what they mean and how they should be calculated. For illustration, I thought I’d use some small teaching example data sets, but whilst writing the post it started to get a little on the long side. So, I’ve broken it into two and in this part I look at logistic regression.

## Simultaneous intervals for derivatives of smooths revisited

#### 21 March 2017 /posted in: R

Eighteen months ago I screwed up! I’d written a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalized spline. It was a nice post that attracted some interest. It was also wrong. In December I corrected the first part of that mistake by illustrating one approach to compute an actual simultaneous interval, but only for the fitted smoother. At the time I thought that the approach I outlined would translate to the derivatives but I was being lazy then Christmas came and went and I was back to teaching — you know how it goes. Anyway, in this post I hope to finally rectify my past stupidity and show how the approach used to generate simultaneous intervals from the December 2016 post can be applied to the derivatives of a spline.

## Modelling extremes using generalized additive models

#### 25 January 2017 /posted in: R

Quite some years ago, whilst working on the EU Sixth Framework project *Euro-limpacs*, I organized a workshop on statistical methods for analyzing time series data. One of the sessions was on the analysis of extremes, ably given by Paul Northrop (UCL Department of Statistical Science). That intro certainly whet my appetite but I never quite found the time to dig into the arcane world of extreme value theory. Two recent events rekindled my interest in extremes; Simon Wood quietly introduced into his **mgcv** package a family function for the generalized extreme value distribution (GEV), and I was asked to review a paper on extremes in time series. Since then I’ve been investigating options for fitting models for extremes to environmental time series, especially those that allow for time-varying effects of covariates on the parameters of the GEV. One of the first things I did was sit down with **mgcv** to get a feel for the `gevlss()`

family function that Simon had added to the package by repeating an analysis of a classic example data set that had been performed using the **VGAM** package of Thomas Yee.

## Pangaea and R and open palaeo data (also GAM all the things!)

#### 16 December 2016 /posted in: R

For a while now, I’ve been wanting to experiment with rOpenSci’s **pangaear** package (Chamberlain et al., 2016), which allows you to search, and download data from, the Pangaea, a major data repository for the earth and environmental sciences. Earlier in the year, as a member of the editorial board of Scientific Data, Springer Nature’s open data journal I was handling a data descriptor submission that described a new 2,200-year foraminiferal δ^{18}O record from the Gulf of Taranto in the Ionian Sea (Taricco et al., 2016). The data descriptor was recently published and as part of the submission Carla Taricco deposited the data set in Pangaea. So, what better opportunity to test out **pangaear**? (Oh and to fit a GAM to the data while I’m at it!)

Chamberlain, S., Woo, K., MacDonald, A., Zimmerman, N., and Simpson, G. (2016). *Pangaear: Client for the ’pangaea’ database*. Available at: https://CRAN.R-project.org/package=pangaear.

Taricco, C., Alessio, S., Rubinetti, S., Vivaldo, G., and Mancuso, S. (2016). A foraminiferal ()18O record covering the last 2,200 years. *Scientific Data* 3, 160042. doi:10.1038/sdata.2016.42.