Processing sample labels using regular expressions in R

Gavin Simpson

I am often found in possession of palaeo core data where the sample identifiers contain a core code or label plus the sample depth. Often these are things generated by colleagues who have used other software where for one reason or another they don’t want to store the depth information as a separate numeric variable. I also generate such data sets, not because I want to but because the software often supplied with lab equipment (most recent example is the Thermo Flash EA/Delta V I’ve been running stable N and C isotope measurements on) that records data/measurements using a single character identifier variable.

The information in these labels is useful and I really don’t want to type out all the depths again and it’s not just because I am lazy; the more times you have to enter data the more opportunities for transcription errors to creep into your work and analysis.

> (eg1 <- paste0("CORE", 0:10 + 0.5))
 [1] "CORE0.5"  "CORE1.5"  "CORE2.5"  "CORE3.5"  "CORE4.5"  "CORE5.5"
 [7] "CORE6.5"  "CORE7.5"  "CORE8.5"  "CORE9.5"  "CORE10.5"
> (eg2 <- paste0("FOO_", 0:10 + 0.5))
 [1] "FOO_0.5"  "FOO_1.5"  "FOO_2.5"  "FOO_3.5"  "FOO_4.5"  "FOO_5.5"
 [7] "FOO_6.5"  "FOO_7.5"  "FOO_8.5"  "FOO_9.5"  "FOO_10.5"

What can be done to process these sorts of data with R to extract the useful information?

With eg2 we could split the strings on _ using strsplit() and process the resulting components. For example

> as.numeric(sapply(strsplit(eg2, "_"), `[`, 2))
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5

To see how that code works, note that strsplit() returns a list with as many components as elements in the character vector supplied (e.g. length(eg2)). Each component of the list contains the individual character strings created by splitting.

> head(spl <- strsplit(eg2, "_"), 2)
[[1]]
[1] "FOO" "0.5"

[[2]]
[1] "FOO" "1.5"

Notice that the depth information is in the second element of each list component. To access this information for the first component we might use spl[[1]][2] and the second one via spl[[2]][2]. Notice that the only thing that is changing here is the number in the [[ ]]. To each of the components of spl we are applying the [ function with argument 2; that can be automated via sapply() as shown above. The last part of the example just coerces the character vector of depths to a numeric one.

All of that is a bit of a faff and won’t work for eg1 because there is nothing to split on. An alternative solution is to use regular expressions. I’m no regular expression expert and if there is anything in computing that will warp your feeble little mind it is a regular expression. However, these things are incredibly useful for matching or extracting bits of data from strings.

A regular expression contains placeholders or entities that you want to match or find within a given set of strings. For example, here is a modified version of eg1 where the last element has a different format to the rest

> (eg3 <- c(eg1, "12.5CORE"))
 [1] "CORE0.5"  "CORE1.5"  "CORE2.5"  "CORE3.5"  "CORE4.5"  "CORE5.5"
 [7] "CORE6.5"  "CORE7.5"  "CORE8.5"  "CORE9.5"  "CORE10.5" "12.5CORE"

To match only those with one or more alphabetical characters are the start of the string we can use "^[A-Za-z]+" as our regular expression and the grep() to do the matching

> grep("^[A-Za-z]+", eg3, value = TRUE)
 [1] "CORE0.5"  "CORE1.5"  "CORE2.5"  "CORE3.5"  "CORE4.5"  "CORE5.5"
 [7] "CORE6.5"  "CORE7.5"  "CORE8.5"  "CORE9.5"  "CORE10.5"

The [A-Za-z] means match anything that is a letter in the English language alphabet. I added a qualifier, the +, which means match one or more of these letters. The last bit of the regular expression is the ^, which indicates that matches should begin with one or more letters; anything that doesn’t begin with one or more letters will not be matched. If you look carefully at the result, "12.5CORE" is missing because it doesn’t start with one or more letters.

To match one or more letters at the end of a string, the $ can be used, e.g.

> grep("[A-Za-z]+$", eg3, value = TRUE)
[1] "12.5CORE"

Let’s turn our attention back to eg1. A regular expression that would match each component of the strings could be "([A-Za-z]+)([0-9\\.]+)". The parentheses group the various parts of the expression which we’ll use in a moment. The first set of parentheses matches one or more letters whilst the second set matches one or more digits plus the decimal point. The decimal point has been escaped (which in R requires two not the usual one backslash) as it is a regular expression meta character (like + and *) that matches a single character. We want a literal . so we escape its usual meaning. As we now have a regular expression that will match the format of our sample labels we can proceed to manipulate them. This is where the parentheses come in. As I said, these group matches within the single expression. The matches within the parentheses can be referred to using backreferences. So I could use \\1 to refer to the strings matched by the first set of parentheses and \\2 to matches in the second set. Note we need to double backslash here as this is R.

To achieve our final goal of extracting the depth information from the sample labels we can combine this regular expression with the gsub() function, which does string replacement using regular expressions. If we think about what we want to do, we want to essentially replace the sample label with the extracted depth information to form a new set of strings. So we can match the two parts of our sample labels using our regular expression and replace them with a backreference to the depth part matched by the second set of parentheses. For example:

> gsub("([A-Za-z]+)([0-9\\.]+)", "\\2", eg1)
 [1] "0.5"  "1.5"  "2.5"  "3.5"  "4.5"  "5.5"  "6.5"  "7.5"  "8.5"  "9.5"
[11] "10.5"

All that remains is to coerce that to numeric and we have our depth data

> as.numeric(gsub("([A-Za-z]+)([0-9\\.]+)", "\\2", eg1))
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5

eg2 can be handled in a similar way but we need to add _ to the characters matched by the first set of parentheses

> as.numeric(gsub("([A-Za-z_]+)([0-9\\.]+)", "\\2", eg2))
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5

or add it as a literal _ between the two sets

> as.numeric(gsub("([A-Za-z]+)_([0-9\\.]+)", "\\2", eg2))
 [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5

If you had a morecomplicated data set with several cores in the same file, identified by a different core code, regular expressions can be used to extract the core and depth information. For example, given

> set.seed(1)
> dat <- data.frame(Label = paste0(rep(c("WAST", "NAGA"), each = 3), rep(0:2 + 0.5, 3)),
+                    Value = runif(6))
> dat
    Label     Value
1 WAST0.5 0.2655087
2 WAST1.5 0.3721239
3 WAST2.5 0.5728534
4 NAGA0.5 0.9082078
5 NAGA1.5 0.2016819
6 NAGA2.5 0.8983897

we could add site and label data using

> rexp <- "([A-Za-z]+)([0-9\\.]+)"
> dat <- transform(dat, Site  = gsub(rexp, "\\1", Label),
+                        Depth = as.numeric(gsub(rexp, "\\2", Label)))
> dat
    Label     Value Site Depth
1 WAST0.5 0.2655087 WAST   0.5
2 WAST1.5 0.3721239 WAST   1.5
3 WAST2.5 0.5728534 WAST   2.5
4 NAGA0.5 0.9082078 NAGA   0.5
5 NAGA1.5 0.2016819 NAGA   1.5
6 NAGA2.5 0.8983897 NAGA   2.5

These are just some very simple regular expressions but hopefully you can see their power and utility for manipulations of character data that palaeo-types often have to handle.