I am often found in possession of palaeo core data where the sample identifiers contain a core code or label plus the sample depth. Often these are things generated by colleagues who have used other software where for one reason or another they don’t want to store the depth information as a separate numeric variable. I also generate such data sets, not because I want to but because the software often supplied with lab equipment (most recent example is the Thermo Flash EA/Delta V I’ve been running stable N and C isotope measurements on) that records data/measurements using a single character identifier variable.
The information in these labels is useful and I really don’t want to type out all the depths again and it’s not just because I am lazy; the more times you have to enter data the more opportunities for transcription errors to creep into your work and analysis.
What can be done to process these sorts of data with R to extract the useful information?
eg2 we could split the strings on
strsplit() and process the resulting components. For example
To see how that code works, note that
strsplit() returns a list with as many components as elements in the character vector supplied (e.g.
length(eg2)). Each component of the list contains the individual character strings created by splitting.
Notice that the depth information is in the second element of each list component. To access this information for the first component we might use
spl[] and the second one via
spl[]. Notice that the only thing that is changing here is the number in the
[[ ]]. To each of the components of
spl we are applying the
[ function with argument
2; that can be automated via
sapply() as shown above. The last part of the example just coerces the character vector of depths to a numeric one.
All of that is a bit of a faff and won’t work for
eg1 because there is nothing to split on. An alternative solution is to use regular expressions. I’m no regular expression expert and if there is anything in computing that will warp your feeble little mind it is a regular expression. However, these things are incredibly useful for matching or extracting bits of data from strings.
A regular expression contains placeholders or entities that you want to match or find within a given set of strings. For example, here is a modified version of
eg1 where the last element has a different format to the rest
To match only those with one or more alphabetical characters are the start of the string we can use
"^[A-Za-z]+" as our regular expression and the
grep() to do the matching
[A-Za-z] means match anything that is a letter in the English language alphabet. I added a qualifier, the
+, which means match one or more of these letters. The last bit of the regular expression is the
^, which indicates that matches should begin with one or more letters; anything that doesn’t begin with one or more letters will not be matched. If you look carefully at the result,
"12.5CORE" is missing because it doesn’t start with one or more letters.
To match one or more letters at the end of a string, the
$ can be used, e.g.
Let’s turn our attention back to
eg1. A regular expression that would match each component of the strings could be
"([A-Za-z]+)([0-9\\.]+)". The parentheses group the various parts of the expression which we’ll use in a moment. The first set of parentheses matches one or more letters whilst the second set matches one or more digits plus the decimal point. The decimal point has been escaped (which in R requires two not the usual one backslash) as it is a regular expression meta character (like
*) that matches a single character. We want a literal
. so we escape its usual meaning. As we now have a regular expression that will match the format of our sample labels we can proceed to manipulate them. This is where the parentheses come in. As I said, these group matches within the single expression. The matches within the parentheses can be referred to using backreferences. So I could use
\\1 to refer to the strings matched by the first set of parentheses and
\\2 to matches in the second set. Note we need to double backslash here as this is R.
To achieve our final goal of extracting the depth information from the sample labels we can combine this regular expression with the
gsub() function, which does string replacement using regular expressions. If we think about what we want to do, we want to essentially replace the sample label with the extracted depth information to form a new set of strings. So we can match the two parts of our sample labels using our regular expression and replace them with a backreference to the depth part matched by the second set of parentheses. For example:
All that remains is to coerce that to numeric and we have our depth data
eg2 can be handled in a similar way but we need to add _ to the characters matched by the first set of parentheses
or add it as a literal
_ between the two sets
If you had a morecomplicated data set with several cores in the same file, identified by a different core code, regular expressions can be used to extract the core and depth information. For example, given
we could add site and label data using
These are just some very simple regular expressions but hopefully you can see their power and utility for manipulations of character data that palaeo-types often have to handle.