New version of analogue on CRAN

Gavin Simpson

It has been almost a year since the last release of the analogue package. At lot has happened in the intervening period and although I’ve been busy with a new job in a new country and coding on several other R packages, activity on analogue has also progressed a pace. As the version 0.12-0 of the package hits a CRAN mirror near you, I thought I’d outline the major changes in the packages, which range from at long last having dissimilarity matrices computed in fast C code to lots of new functionality that makes fitting principal curves and plotting and interpreting the results much easier, from a more robust way to determine the posterior probability that two samples are analogues to rounding out the fitting of calibration models using principal components regression with ecologically-meaningful transformations.

Dissimilarity matrices

The original intent for the analogue package was for methods related to analogue matching, which at their heart involve computing dissimilarities between samples. Computing these pairwise-distances is quite time consuming, and even though I had a fairly efficient R-based implementation I always wanted to rewrite distance() to use fast C code. This has finally happened! The old behaviour can still be accessed via oldDistance() and as I have two implementations of the same functions I use both and compare results as part of new unit tests for the package. As far as the user is concerned, nothing has changed; the interface and arguments provided by distance() are the same as with previous versions. But the underlying code is now much quicker, especially for larger problems.

Principal curves

Fitting and working with principal curves, flexible smooth curves fitted in high dimensions, saw lots of additions and improvements in the period between the 0.10-0 and 0.12-0 releases. There are new methods for lines(), points(), scores() and residuals() which work with the output of prcurve(), and a nice 3D plotting function, plot3d(), courtesy of the rgl package. ~~Passive samples can now be handled through the provision of a predict() method.~~ Up to now, the only smoother that could be used to fit principal curves was a smoothing spline via smooth.spline(). With this release of analogue, GAMs can be used instead. This allows for better handling of species data via say Poisson or logistic regression, just as you’d fit response curves to individual species. This functionality is provided via gam() from package mgcv.

The object returned from prcurve() has also expanded to supply more useful information on the fit and to allow easier plotting of the curve. Now, each of the fitted smooth models is returned so that they can be inspected for individual species. In addition, the PCA space of the data is available as component ordination and the original species data is also returned.

Posterior probability of analogue-ness

analogue contains two functions to assess the degree to which samples are analogues of one another; roc() and logitreg(). logitreg() uses a logistic regression to model the posterior probability that two samples are analogues given their dissimilarity via a glm() fit. Such binomial models can suffer from several problems, especially separation, whereby at some value of the covariates perfect discrimination between the two values of the response is achieved. These models can also become biased if the relative proportions of 0s and 1s in the data is strongly skewed to one class or the other. Firth’s bias-reduced logistic regression is a useful alternative in such circumstances. With this release, logitreg() can fit bias-reduced logistic regression models by use of functions in the brglm package, as well as the standard GLM implemented in glm().

Principal component regression

Principal component regression (PCR) is a linear calibration method, used in chemometrics, and a form of PCR was used by Imbrie & Kipp in their original palaeoecological transfer function methodology. However, as it is a linear method it generally fails to adapt well to the often non-linear responses observed in species-environment data sets. In 2001 Pierre Legendre and Eugene Gallagher introduced the ecological world to the use of PCA on transformed data which could adequately model species data and being a simpler method it did not suffer from issues related to outliers or odd samples that plague CA. Their method achieved this via transformations of data that, when ordinated using PCA and the implicit Euclidean distance, result in an ordination that preserved a distance function other than the Euclidean. For example, if a Hellinger transformation is used, the PCA of such transformed data results in the ordination reflecting the Hellinger distances between samples in the scale of the original data.

The pcr() function in analogue extends this idea to principal component regression, and was added to the package in version 0.8-0. Version 0.12-0 completes the basic functionality required to use PCR with ecologically-meaningful transformations. The full range of cross-validation methods (n repeats of k-fold, leave-one-out, and bootstrap CV) are now included in the crossval() method. Predictions from new samples can now be produced using the predict() method and sample-specific errors derived using n repeats of k-fold, and bootstrap CV.

A summary of the main changes is in the new NEWS file, and a detailed list in the ChangeLog. I have a number of posts in development that will illustrate some of the above new functionality and methods, which will be posted over the next few weeks.

You can get the new version of analogue now from CRAN.