Modelling extremes using generalized additive models
25 January 2017 /posted in: R
Quite some years ago, whilst working on the EU Sixth Framework project Euro-limpacs, I organized a workshop on statistical methods for analyzing time series data. One of the sessions was on the analysis of extremes, ably given by Paul Northrop (UCL Department of Statistical Science). That intro certainly whet my appetite but I never quite found the time to dig into the arcane world of extreme value theory. Two recent events rekindled my interest in extremes; Simon Wood quietly introduced into his mgcv package a family function for the generalized extreme value distribution (GEV), and I was asked to review a paper on extremes in time series. Since then I’ve been investigating options for fitting models for extremes to environmental time series, especially those that allow for time-varying effects of covariates on the parameters of the GEV. One of the first things I did was sit down with mgcv to get a feel for the
gevlss() family function that Simon had added to the package by repeating an analysis of a classic example data set that had been performed using the VGAM package of Thomas Yee.
Pangaea and R and open palaeo data (also GAM all the things!)
16 December 2016 /posted in: R
For a while now, I’ve been wanting to experiment with rOpenSci’s pangaear package (Chamberlain et al., 2016), which allows you to search, and download data from, the Pangaea, a major data repository for the earth and environmental sciences. Earlier in the year, as a member of the editorial board of Scientific Data, Springer Nature’s open data journal I was handling a data descriptor submission that described a new 2,200-year foraminiferal δ18O record from the Gulf of Taranto in the Ionian Sea (Taricco et al., 2016). The data descriptor was recently published and as part of the submission Carla Taricco deposited the data set in Pangaea. So, what better opportunity to test out pangaear? (Oh and to fit a GAM to the data while I’m at it!)
Chamberlain, S., Woo, K., MacDonald, A., Zimmerman, N., and Simpson, G. (2016). Pangaear: Client for the ’pangaea’ database. Available at: https://CRAN.R-project.org/package=pangaear.
Taricco, C., Alessio, S., Rubinetti, S., Vivaldo, G., and Mancuso, S. (2016). A foraminiferal ()18O record covering the last 2,200 years. Scientific Data 3, 160042. doi:http://doi.org/10.1038/sdata.2016.42.
Simultaneous intervals for smooths revisited correcting a silly mistake
15 December 2016 /posted in: R
Eighteen months ago I wrote a post in which I described the use of simulation from the posterior distribution of a fitted GAM to derive simultaneous confidence intervals for the derivatives of a penalised spline. It was a nice post that attracted some interest. It was also wrong. I have no idea what I was thinking when I thought the intervals described in that post were simultaneous. Here I hope to rectify that past mistake.
I’ll tackle the issue of simultaneous intervals for the derivatives of penalised spline in a follow-up post. Here, I demonstrate one way to compute a simultaneous interval for a penalised spline in a fitted GAM. As example data, I’ll use the strontium isotope data set included in the SemiPar package, and which is extensively analyzed in the monograph Semiparametric Regression (Ruppert et al., 2003). First, load the packages we’ll need as well as the data, which is data set
fossil. If you don’t have SemiPar installed, install it using
install.packages(“SemiPar”) before proceeding
Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric regression. Cambridge University Press.
ISEC 2016 Talk
02 July 2016 /posted in: Science
My ISEC 2016 talk, Estimating temporal change in mean and variance of community composition via location, scale additive models, describes some of my recent research into methods to analyse palaeoenvironmental time series from sediment cores.
Rootograms a new way to assess count models
07 June 2016 /posted in: R
Assessing the fit of a count regression model is not necessarily a straightforward enterprise; often we just look at residuals, which invariably contain patterns of some form due to the discrete nature of the observations, or we plot observed versus fitted values as a scatter plot. Recently, while perusing the latest statistics offerings on ArXiv I came across Kleiber and Zeileis (2016) who propose the rootogram as an improved approach to the assessment of fit of a count regression model. The paper is illustrated using R and the authors’ countreg package (currently on R-Forge only). Here, I thought I’d take a quick look at the rootogram with some simulated species abundance data.
Kleiber, C., and Zeileis, A. (2016). Visualizing count data regressions using rootograms.
Harvesting more Canadian climate data
24 May 2016 /posted in: R
A while back I wrote some code to download climate data from Government of Canada’s historical climate/weather data website for one of our students. In May this year (2016) the Government of Canada changed their website a little and the API code that responded to requests had changed URL and some of the GET parameters had also changed. In fixing those functions I also noted that the original code only downloaded hourly data and not all useful weather variables are recorded hourly; precipitation for example is only in the daily and monthly data formats. This post updates the earlier one, explaining what changed and how the code has been updated. As an added benefit, the functions can now handle downloading daily and monthly data files as well as the hourly files that the original could handle.
A new default plot for multivariate dispersions tribulations of base graphics programming
17 April 2016 /posted in: R
This weekend, prompted by a pull request from Michael Friendly, I finally got round to improving the
plot method for
betadisper() in the vegan package.
betadisper() is an implementation of Marti Anderson’s Permdisp method, a multivariate analogue of Levene’s test for homogeneity of variances. In improving the default plot and allowing customisation of plot features, I was reminded of how much I dislike programming plot functions that use base graphics. But don’t worry, this isn’t going to degenerate into a ggplot love-in nor a David Robinson-esque dig at Jeff Leek.
10 April 2016 /posted in: Science
It’s fair to say I have gotten a bee1 in my bonnet about how palaeolimnologists handle time. For a group of people for whom time is everything, we sure do a poor job (in general) of dealing with it in when it comes time to analyse our data. In many instances, “poor job” means making no attempt at all to account for the special nature of the time series. LOESS comes in for particular criticism because it is widely used by palaeolimnologists despite not being particularly suited to the task. Why this is so is perhaps due to it’s promotion in influential books, papers, and software. I am far from innocent in this regard having taught LOESS and it’s use for many years on the now-defunct ECRC Numerical Course. Here I want to look at further problems with our use of LOESS, and will argue that we need to resign it to the trash can for all but exploratory analyses. I will begin the case for the prosecution with one of my own transgressions.
an entire hive is perhaps more apt!↩
Soap-film smoothers & lake bathymetries
27 March 2016 /posted in: R
A number of years ago, whilst I was still working at ENSIS, the consultancy arm of the ECRC at UCL, I worked on a project for the (then) Countryside Council for Wales (CCW; now part of Natural Resources Wales). I don’t recall why they were doing this project, but we were tasked with producing a standardised set of bathymetric maps for Welsh lakes. The brief called for the bathymetries to be provided in standard GIS formats. Either CCW’s project manager or the project lead at ENSIS had proposed to use inverse distance weighting (IWD) to smooth the point bathymetric measurements. This probably stemmed from the person that initiatied our bathymetric programme at ENSIS being a GIS wizard, schooled in the ways of ArcGIS. My involvement was mainly data processing of the IDW results. I was however, at the time, also somewhat familiar with the problem of finite area smoothing1 and had read a paper of Simon Wood’s on his then new soap-film smoother (Wood et al., 2008). So, as well as writing scripts to process and present the IDW-based bathymetry data in the report, I snuck a task into the work programme that allowed me to investigate using soap-film smoothers for modelling lake bathymetric data. The timing was never great to write up this method (two children and a move to Canada have occurred since the end of this project), so I’ve not done anything with the idea. Until now…
Wood, S. N., Bravington, M. V., and Hedley, S. L. (2008). Soap film smoothing. Journal of the Royal Statistical Society. Series B, Statistical methodology 70, 931–955. doi:http://doi.org/10.1111/j.1467-9868.2008.00665.x.
smoothing over a domain with known boundaries, like a lake↩
Additive modelling global temperature time series: revisited
25 March 2016 /posted in: R
Quite some time ago, back in 2011, I wrote a post that used an additive model to fit a smooth trend to the then-current Hadley Centre/CRU global temperature time series data set. Since then the media and scientific papers have been full of reports of record warm temperatures in the past couple of years, of controversies (imagined) regarding data-changes to suit the hypothesis of human induce global warming, and the brouhaha over whether global warming had stalled; the great global warming hiatus or pause. So it seemed like a good time to revisit that analysis and update it using the latest HadCRUT data.