Pivoting tidily

25 October 2019 /posted in: R

One of the fun bits of my job is that I have actual time dedicated to helping colleagues and grad students with statistical or computational problems. Recently I’ve been helping one of our Lab Instructors with some data that from their Plant Physiology Lab course. Whilst I was writing some R code to import the raw data for the lab from an Excel sheet, it occurred to me that this would be a good excuse to look at the new pivot_longer() and pivot_wider() functions from the tidyr package. In this post I show how these new functions facilitate common data processing steps; I was personally surprised how little data wrangling was actually needed in the end to read in the data from the lab.

Read on »

radian: a modern console for R

18 June 2019 /posted in: R

Whenever I’m developing R code or writing data wrangling or analysis scripts for research projects that I work on I use Emacs and its add-on package Emacs Speaks Statistics (ESS). I’ve done so for nigh on a couple of decades now, ever since I switched full time to running Linux as my daily OS. For years this has served me well, though I wouldn’t call myself an Emacs expert; not even close! With a bit of help from some R Core coding standards document I got indentation working how I like it, I learned to contort my fingers in weird and wonderful ways to execute a small set of useful shortcuts, and I even committed some of those shortcuts to memory. More recently, however, my go-to methods for configuring Emacs+ESS were failing; indentation was all over the shop, the smart _ stopped working or didn’t work as it had for over a decade, syntax highlighting of R-related files, like .Rmd was hit and miss, and polymode was just a mystery to me. Configuring Emacs+ESS was becoming much more of a chore, and rather unhelpfully, my problems coincided with my having less and less time to devote to tinkering with my computer setups. Also, fiddling with this stuff just wasn’t fun any more. So, in a fit of pique following one to many reconfiguration sessions of Emacs+ESS, I went in search of some greener grass. During that search I came across radian, a neat, attractive, simple console for working with R.

Read on »

Tibbles, checking examples, & character encodings

22 January 2019 /posted in: R

Recently I’ve been preparing my gratia package for submission to CRAN. During my pre-flight testing I noticed an issue under Windows checking the examples in the package against the reference output I generated on linux. In the latest release of the tibble package, the way tibbles are printed has changed subtly and in a way that leads to cross-platform differences. As I write this, tibbles with more than a set number of rows are printed in a truncated form, showing only the first 10 rows of data. In such cases, a final line is printed with an ellipsis and a note as to how many more rows are in the tibble. It was this ellipsis that was causing the cross-platform issue where differences between the output generated on windows and the reference output were being identified during R CMD check on Windows. If this is causing you an issue, here’s one way to solve the problem.

Read on »

What's wrong with software paper preprints on EarthArXiv?

20 December 2018 /posted in: Science

Via Twitter I recently found out that EarthArXiv, a new preprint server for the geosciences doesn’t accept software paper submissions. Actually, EarthArXiv doesn’t accept quite a few types of publication — some justifiably, like ad hominem attack pieces, others unjustifiably like correspondence or opinion pieces. I find this general stance very odd indeed; commentary, editorial or opinion pieces and software papers are accepted in a large number of the general and specialized journals that serve the geoscience field, so why wouldn’t EarthArxiv want to host these prior to publication of the version of record in one of those journals?

Read on »

Confidence intervals for GLMs

10 December 2018 /posted in: R

You’ve estimated a GLM or a related model (GLMM, GAM, etc.) for your latest paper and, like a good researcher, you want to visualise the model and show the uncertainty in it. In general this is done using confidence intervals with typically 95% converage. If you remember a little bit of theory from your stats classes, you may recall that such an interval can be produced by adding to and subtracting from the fitted values 2 times their standard error. Unfortunately this only really works like this for a linear model. If I had a dollar (even a Canadian one) for every time I’ve seen someone present graphs of estimated abundance of some species where the confidence interval includes negative abundances, I’d be rich! Here, following the rule of “if I’m asked more than once I should write a blog post about it!” I’m going to show a simple way to correctly compute a confidence interval for a GLM or a related model.

Read on »

Introducing gratia

23 October 2018 /posted in: R

I use generalized additive models (GAMs) in my research work. I use them a lot! Simon Wood’s mgcv package is an excellent set of software for specifying, fitting, and visualizing GAMs for very large data sets. Despite recently dabbling with brms, mgcv is still my go-to GAM package. The only down-side to mgcv is that it is not very tidy-aware and the ggplot-verse may as well not exist as far as it is concerned. This in itself is no bad thing, though as someone who uses mgcv a lot but also prefers to do my plotting with ggplot2, this lack of awareness was starting to hurt. So, I started working on something to help bridge the gap between these two separate worlds that I inhabit. The fruit of that labour is gratia, and development has progressed to the stage where I am ready to talk a bit more about it.

gratia is an R package for working with GAMs fitted with gam(), bam() or gamm() from mgcv or gamm4() from the gamm4 package, although functionality for handling the latter is not yet implement. gratia provides functions to replace the base-graphics-based plot.gam() and gam.check() that mgcv provides with ggplot2-based versions. Recent changes have also resulted in gratia being much more tidyverse aware and it now (mostly) returns outputs as tibbles.

In this post I wanted to give a flavour of what is currently possible with gratia and outline what still needs to be implemented.

Read on »

Controls on subannual variation in pCO2 in productive hardwater lakes

15 October 2018 /posted in: Science

This year is looking like a bumper year for papers from the lab and collaborations, past and ongoing. Over the summer hiatus three papers came out online in their version-of-record form. The first of these was a paper on work that Emma Wiik, a former postdoc in my lab and Peter Leavitt’s lab, conducted to further our research on the controls on CO2 exchange between lakes and the atmosphere.

Read on »

Summer hiatus

15 October 2018 /posted in: Science

It’s been quite some time since I last posted anything here. Mostly this was due to a very busy schedule since May that included teaching an online stats course, attending & presenting at three conferences, giving workshops at two of those conferences, and taking some well-earned vacation in Europe. Summer was also a busy time for manuscripts moving through the pipeline to being accepted and published. One thing I had hoped to do with the blog this year was publicize some of the work I do a little more. So, as normal service resumes here I hope to post some short pieces highlighting new papers that came out over the summer, and a few of these will be coming out over the next week or two.

One of the reasons for having this blog in the first place was to get me back into “writing mode”; I find it difficult at times, especially when the to-do list is long, to force myself to carve out time to both think and write. And as I get more and more out of practice writing, it takes more and more time to start or pick up work on manuscripts describing new results, and the words don’t flow easily at all. I find it much easier to write when I am towards the end of a writing period because I’ve literally forced myself to write. And, whilst blog posts aren’t the same kind of writing as for manuscripts, I hope that by just doing a little writing each week, it’ll be that bit easier to pick up work on a languishing manuscript or start something new.

Let’s see how I get on…

Read on »

Fitting GAMs with brms: part 1 a simple GAM

21 April 2018 /posted in: R

Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the mgcv package. I use these models all the time in my research but recently we’ve been hitting the limits of the range of models that mgcv can fit. So I’ve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The brms package (Bürkner, 2017) is an excellent resource for modellers, providing a high-level R front end to a vast array of model types, all fitted using Stan. brms is the perfect package to go beyond the limits of mgcv because brms even uses the smooth functions provided by mgcv, making the transition easier. In this post I take a look at how to fit a simple GAM in brms and compare it with the same model fitted using mgcv.

Bürkner, P.-C. (2017). brms: An R package for bayesian multilevel models using Stan. Journal of Statistical Software 80, 1–28. doi:10.18637/jss.v080.i01.

Read on »

Comparing smooths in factor-smooth interactions II ordered factors

14 December 2017 /posted in: R

In a previous post I looked at an approach for computing the differences between smooths estimated as part of a factor-smooth interaction using s()’s by argument. When a common-or-garden factor variable is passed to by, gam() estimates a separate smooth for each level of the by factor. Using the (Xp) matrix approach, we previously saw that we can post-process the model to generate estimates for pairwise differences of smooths. However, the by variable approach of estimating a separate smooth for each level of the factor my be quite inefficient in terms of degrees of freedom used by the model. This is especially so in situations where the estimated curves are quite similar but wiggly; why estimate many separate wiggly smooths when one, plus some simple difference smooths, will do the job just as well? In this post I look at an alternative to estimating separate smooths using an ordered factor for the by variable.

Read on »