# First steps with MRF smooths

One of the specialist smoother types in the **mgcv** package is the Markov Random Field (MRF) smooth. This smoother essentially allows you to model spatial data with an intrinsic Gaussian Markov random field (GMRF). GRMFs are often used for spatial data measured over discrete spatial regions. MRFs are quite flexible as you can think about them as representing an undirected graph whose nodes are your samples and the connections between the nodes are specified via a neighbourhood structure. I’ve become interested in using these MRF smooths to include information about relationships between species. However, these smooths are not widely documented in the smoothing literature so working out how best to use them to do what we want has been a little tricky once you move beyond the typical spatial examples. As a result I’ve been fiddling with these smooths, fitting them to some spatial data I came across in a tutorial Regional Smoothing in R from The Pudding. In this post I take a quick look at how to use the MRF smooth in **mgcv** to model a discrete spatial data set from the US Census Bureau.

In that tutorial, the example data are taken from the US Census Bureau via a shapefile prepared by the author. After a little munging — quite a few steps are missing from the tutorial — I managed to get data from the shapefile that matched what was used in the tutorial. The data are on county level percentages of US adults whose highest level of education attainment is a high school diploma. The raw data are shown in the figure below

To follow along, you’ll need to download the example shapefile provided by the author of the post on The Pudding. The shapefile(s) are in a ZIP, which I extracted into the working directory; the code below assumes this.

This post will make use of the following set of package; load them now, as shown below, and install any that you may be missing

Assuming you have extracted the shapefile, we load it into R using `readOGR()`

and do some data munging

The shapefile contains US Census Bureau data for all US counties, including many that are far from the continental USA. The tutorial from The Pudding doesn’t go into how they removed, or how they drew a map without these additional counties. For our purposes they may cause complications when we try to model them using the MRF smooth. I’m sure the modelling approach can handle data like this, but as I wanted to achieve something that followed the tutorial I’ve removed everything not linked to the continental US landmass, including (I’m sorry!), Alaska and Hawaii — my **ggplot** and mapping skills aren’t yet good enough to move Alaska and Hawaii to the bottom left of such maps.

The data were projected using the Albers equal area projection and subsequently passed to the `fortify()`

method from **ggplot2** to get a version of the county polygons suitable for plotting with that package.

Finally, I created a new variable `hsd`

which is just the variable `hs_pct`

divided by 100. This creates a proportion that we’ll need for model fitting as you’ll see shortly.

Before we can model these data with `gam()`

, we need to create the supporting information that `gam()`

will use to create the MRF smooth penalty. The penalty matrix in an MRF smooth is based on the neighbourhood structure of the observations. There are three ways to pass this information to `gam()`

- as a list of polygons (not
`SpatialPolygons`

, I believe) - as a list containing the neighbourhood structure, or
- the raw penalty matrix itself.

Options 1 and 3 aren’t easily doable as far I can see — `gam()`

isn’t expecting the sort of object we created when we imported the shapefile and nobody want’s to build a penalty matrix by hand! Thankfully option 2, the neighbourhood structure is relatively easy to create. For that I use the `poly2nb()`

function from the **spdep** package. This function takes a shapefile and works out which regions are neighbours of any other region by virtue of them sharing a border. To make sure everything matches up nicely in the way `gam()`

wants this list, we specify that the region IDs should be the `GEOID`

s from the original data set (the `GEOID`

uniquely identifies each county) and we have to set the `names`

attribute on the neighbouthood list to match these unique IDs

The result of the previous chunk is a list whose names map on to the levels of the `GEOID`

factor. The values in each element of `nb`

index the elements of `nb`

that are neighbours of the current element

With that done we can now fit the GAM. Fitting this is going to take a wee while (over 3 hours for the full rank MRF, using 6 threads, on a reasonably powerful 3-year old workstation with dual 4-core Xeon processors). To specify an MRF smooth we use the `bs`

argument to the `s()`

function, setting it to `bs = 'mrf'`

. The neighbourhood list is passed via the `xt`

argument, which takes a list as a value; here we specify a component `nb`

which takes our neighbourhood list `nb`

. The final set-up variable to consider is whether to fit a full rank MRF, where a coefficient for each county will be estimated, or a reduced rank MRF, wherein the MRF is represented using fewer coefficients and counties are mapped to the smaller set of coefficients. The rank of the MRF smooth is set using the `k`

argument. The default is to fit a full rank MRF, whilst setting `k < NROW(data)`

will result ins a reduced-rank MRF being etimated.

The full rank MRF model is estimated using

As the response is a proportion, the fitted GAM uses the beta distribution as the conditional distribution of the response. The default link in the logit, just as it is in for the binomial distribution, and insures that fitted values on the scale of the linear predictor are mapped onto the allowed range for proportions of 0–1.

The final model uses in the region of 1700 effective degrees of freedom. This is the smoothness penalty at work; rather than 3108 individual coefficients, the smoothness invoked to try to arrange for neighbouring counties to have similar coefficients has shrunk away almost half of the complexity implied by the full rank MRF.

Whilst the penalty enforces smoothness, further smoothness can be enforced by fitting a reduced rank MRF. In the next code block I fit models with `k = 300`

and `k = 30`

respectively, which imply considerable smoothing relative to the full rank model.

To visualise the different fits we need to generate predicted values on the response scale for each county and add this data to the county data `df`

Before we can plot these fitted values we need to merge `df`

with the fortified shapefile

To facilitate plotting with **ggplot2** I begin by creating some fixed plot components, like the theme, scale, and labels

I took many of these settings from Timo Grossenbacher’s excellent post on mapping regional demographic data in Switzerland.

Now we can plot the fitted proportions. Note that whilst we plot proportions, the colour bar labels are in percentages in keeping with the original data (see the definition for `my_scale`

to see how this was achieved).

Fitted values from the full rank MRF are shown below

This model expains about 90% of the deviance in the original data. Whilst some smoothing is evident, the fitted values show a considerable about of non-spatial variation. This is most likely due to not including important covariates, such as country average income, which might explain some of the finer scale structure; neighbouring counties with quite different proportions. A more considered analysis would include these and other relevant predictors alongside the MRF.

Smoother surfaces can be achieved via the reduced rank MRFs. First the rank 300 MRF

and next the rank 30 MRF

As can be clearly seen from the plots, the degree of smoothness can be controlled effectively via the `k`

argument.

In a future post I’ll take a closer look at using MRFs alongside other covariates as part of a model complex spatial modeling exercise.