Pages

Wednesday, August 13, 2014

99th ESA Meeting in Sacramento

Coming to you live from Sacramento California, it’s Datamancy on the Road!


I’m posting this with extreme pre-talk jitters, I will be presenting tomorrow morning in conference room 314 at the Sacramento conference centre. I’m hoping for the best.
If you’re reading this after the fact and have come to have another look at my slides, thanks for your interest! I really love what I do and enjoy sharing it just as much, skip to the bottom to get to the slides.

Conference Recap

Environmental Sequencing Workshop


The ESA 2014 annual meeting has been a whirlwind of wonderful science and people. I was lucky enough to get to sit in on a wonderful workshop lead by Holy Bik, Kelley Thomas, Dorota Porazinska, and Simon Creer on environmental sequencing approaches. It was a blast to hear from such passionate people in such an interesting field.
I went in to the session hoping to learn a little more about the state of the field in order to hopefully answer a irksome problem my supervisor has been grappling with. Our model species (Splachnum ampullaceum and Splachnum pensylvanicum) are poorly studied genetically so efficient markers don’t exist for detecting their spores in the environment. I wanted to tap into the thoughts of people who deal with this kind of problem daily.
I’ve been inspired to revist the prior work that has been done on developing markers for our species, and perhaps employ some more modern techniques for identifying them. So kudos to the leaders for putting the workshop together.

Talks, Talks, Posters, and More Talks


I really enjoyed getting a whirlwind tour of what’s going on in ecology research. I focused mostly on catching community ecology and modelling talks. I saw many other young scientists put themselves on display for what must have been an intimidating crowd.

I really enjoyed talks from:

Aaron David on microbe innoculation sources for new plants. Interesting work in teasing out whether new plant roots gain their fungal endophytes from the parent plant, conspecific neighbours, or the surrounding soil.

Yue “Max” Li from the Chesson lab, looked at coexistence using a long-standing field experiment.
Jiaqi Tan looked at coexistence mechanisms amongst protists. He used a cool strategy to identify dietary strategy amongst the protist, then used dietary preference as a proxy for niche to estimate niche differentiation (amongst other things).

Dan Scholes from Ken Paige’s lab in Illinois looking at the genetic basis of plant overcompensation in response to herbivory. I had the pleasure of having dinner with Dan and Ken, fascinating work that reminded me of the Arabidopsis days of my undergrad.

Those are just a few of my highlights, I’d continue but it is exhausting looking for personal websites for young scientists.

Wednesday, August 6, 2014

Regular Expressions in R:
A Primer

Preamble


Hi readers,
Sorry for the delay in posting, I’ve been hard at work getting my talk ready for the ESA annual meeting in Sacramento next week. I’ll definitely post a recap and link to my slides after the conference.

I’m on a workacation on Lake of Bays in Ontario (my home province) at the moment, enjoying the lake-side life on my breaks.

I’m continuing work on my Build Your Bar project, dabbling in natural language processing. I find myself using regular expressions (regex) frequently enough to think it warrants a post. The R help on Regex ?regex is long and technical and makes no attempt to convince you of it’s importance. So here is a quick primer on using regex in R, the focus will be on perl-like regular expressions.

History


Regular Expressions have been a part of computing since the early days of home computing. When the primary input-output interface to a computer was text, an efficient means of finding the exact text you were interested in was necessary. “Simple” shorthands for complicated series of characters were developed. The idea for regular expressions originated in the work of Stephen Kleene(1956) on regular sets. Just over a decade later, Ken Thompson imported the ideas into the Unix operating system via the text editing software QED. For a more detailed history of regex I suggest reading Staffan Noteberg’s brief review.

R’s Native Regex


R supports two, or imprecisely three, regex schemes. The two true regex schemes are Extended which follows the POSIX 1003.2 standard and Perl-Like my personal favourite that adapts the incredibly powerful regex syntax of the perl programming language. The last scheme is Fixed which is less of a regex style than a lack thereof, fixed mode will take characters absolutely literally, except for R’s own escape characters.

By default R uses extended regex, the unfortunate consequence is if you choose to follow my lead and use perl-style regex, you have to be a bit more verbose in your code and specify perl = TRUE each time you use a regex function1. I however feel it’s a small price to pay for getting to use perl-like regex.

More about Perl-Like


The help file for regex offers the advice to head to pcre web page, but you will likely want to go straight to the source: http://perldoc.perl.org/perlre.html. The perl documentation page on the current state of their regex syntax is a handy reference for anyone who plans on using perl-like regex.

Unfortunately for us R users, many of perl’s really handy regex features aren’t available to us. Native perl regex allow for concurrent search and accession, meaning you can search for multiple tokens within a string and access the returned tokens without storing the matches in an intermediary object first. More simply, in perl, searching and editing a string is essentially the same action. In R, searching and accession are handled by completely seperate functions, and regular expressions aren’t tokenized to allow multiple operations on the same search criterion.

These differences are due to the fact that perl was designed for text manipulation, it is self-described as the “swiss army knife” of scripting languages allowing you to functionality from multiple languages with perl as the middle-man. The lingua franca of all programming languages is the source code, which perl can elegantly cut, trim, and glue together. R was built for statistics, and so string manipulation is more of an acquired skill. However, let it never be said you can’t do powerful string manipulation in R, it’s just not true. String manipulation in R is clunky, maybe, but powerful enough to do anything you need.

Finding That Special String


I love string manipulation because it appeals to my love of puzzle solving. Finding just the right regex to get what you’re looking for really gives me the thrill of the hunt.
  • So you want to find a number in the middle of a word, but only after an m and only in the last word of string: REGEX
  • So you want to trim whitespace from the end of the string but only if there are 3 or more spaces: REGEX
  • Want to find brand names with between one and three capitalized words in a list of drink ingredients REGEX
Hopefully your interest is piqued and you now want to learn all about using regex like a wizard. If so, great, because I’m going to show you how! But first, a key R idiosyncracy that needs to be addressed.

R Idiosyncracy


R by default recognizes backslashes “\” as the beginning of an escape sequence, also called a metacharacter. Perl also uses a backslash to indicate a metacharacter so there is a clash. In native perl “\d” is a metacharacter meaning all digits, but when R sees “\d” it thinks you’re invoking the metacharacter in an inappropriate way. To get the behaviour you’d expect, you need to escape the behaviour of the backslash to have it pass the un-escaped regex. You do this by adding an extra backslash “\\d” behaves exactly as you’d expect “\d” to.
This situation gets particularly comical when you want to use a literal backslash in a regular expression. In native perl “\\” gets you a literal backslash, but in R you need to escape BOTH backslashes to get the desired behaviour “\\\\” gets you a literal backslash (as I am typing this in R markdown which by default follows R escaping rules, I just typed 8 backslashes to show you four, ugh).

First Steps (Wild-Cards and Quantifiers)


The first things you’re going to need to learn to become a Perl-Like regex master is to use wild-cards and quantifiers. You may have run into wild-cards in any number of different places, google used to support them for example. Wild-cards are regex speak for “just give me anything”.

Unless you’ve worked with regex before you’ve probably never worked closely with quantifiers. Quantifiers are regex speak for “give me some number of these”. If you’ve seen the wild-cards “*” and “?” from DOS, “?” is a wild-card meaning any one character, “*” is a wild-card with an indefinite quantifier meaning give me any number of any character.

Character Meaning
. Match any 1 character
* Match ZERO or more of the previous character
+ Match ONE or more of the previous character
? Match ZERO or ONE of the previous character
{N, M} Match atleast N but fewer than M of the previous character
Leaving M empty ({N,}) means N or more

With these tools at your disposal you can begin your odyssey into the world of pattern matching. Let’s see some examples

Regex Sample Matches
"Al.*" "Alfred", "Albert", "Alphonso", "Alex likes going to the movies"
"5? ?Apples" "5 Apples", "5Apples", " Apples", "Apples"
"(Na )+Batman!" "Na Na Na Na Na Na Na Na Batman!" but not "Batman!"

As you may have noticed, the special quantifiers (“*“,”+“, and”?“) are all just convenient shorthands for (”{0,}“,”{1,}“, and”{0,1}“), you’ll find you use the special quantifiers more frequently.

Brackets, Groups, and Alternation


I’ve hinted at this next group of behaviours in the previous set of examples. The “(Na )+” I used to match the batman theme song is an example of grouping. Using parantheses groups a series of characters such that they can be quantified like a single character. The above regex states that I want one or more sets of the sequence “Na”.

Suppose you weren’t satisfied that you found all the ways people transcribe the batman theme, for example you have a friend use the phoneme “Da” to represent the notes in the song. Never fear, bracketed character classes to the rescue. You can allow the regular expression to choose one of several characters. Bracketed character classes are denoted with square braces “[]”.

Lastly, maybe the vowel is all wrong too. Maybe you think people might be using phonemes like “Duh” or “Nah”. You can specify these specific combinations within a group, and let the regex match any group with an alternation. Alternations are specified with “|”.

Regex Sample Matches
"(Na )+Batman!" "Na Na Na Na Na Na Na Na Batman!"
"([ND]a )+Batman!" "Na Na Na Na Na Na Na Na Batman!" and "Da Na Na Na Da Na Na Na Batman!"
"[ND]((a)|(uh)|(ah))" "Duh", "Na", "Dah"

Now you should be able to get just about any set of characters, and in any amount as you could possibly want.

Positioning and Quantifier Modification


Another useful tool is the ability to specificy where in a string you’d like to match. Two handy positioning modifiers are allowed in a regular experession. You can specify the beginning of a string “^” and the end of a string “$”

Regex Sample Matches
"^Hi" "Hi there reader" but not "You there reader, Hi"
"Goodbye$" "Dear reader, Goodbye" but not "Goodbye dear reader"

Quantifier modification is a topic I’ve touched on in a previous post where I demonstrated how to process html to extract tag values. The question mark has a double meaning depending on where it is found in a regular expression. Normally it is a treated as a quantifier meaning {0,1}, however when it appears immediately after another quantifier, it instructs that quantifier to behave non-greedily.

Greedy is a reference to a quantifier’s habit of trying to capture as many characters as possible. If given a choice between matching one or seven characters, it will choose seven. We can modify this behaviour to just take one.
Take for example the body of a very simple html document:

<body>
<p> Generic things on the <b>Internet<b> have grown <b> tiresome</b></p>
</body>

If I was curious about all the things the author thought worthy of bolding, I might first try the regex
"<b>.+</b>"

But I would get one match <b>Internet<b> have grown <b> tiresome</b>
My indefinite quantifier would match all the way to the very last </b> in the document, obviously not what I wanted. So instead I should tell the “+” quantifier to not be so greedy
"<b>.+?</b>"

Then it would correctly match "<b>Internet</b>" and "<b>Tiresome</b>".

Special Characters


The last type of metacharacter I’d like to show you are the special character classes, useful in bracketed character classes and on there own or combined with quantifiers. Some useful character class metacharacters are

Character Meaning
\w any word character - all numbers, letters, and underscores
\W non-word character - anything except the above
\d any single digit number
\s any space character (regular space or tab)

And with that you should be able to match just about any pattern in a string that your heart desires, here are a few interesting examples.
Stringr comes pre-loaded with a method to trim whitespace from a string, with regex we can do that by saying match some number of whitespaces at the beginning or end of the string like so:

gsub(pattern = \"(^ +)|( +$)\", replacement = "", string)2

Or as I mentioned earlier, maybe you want to find brand names in a set of ingredients, the ingredient set was comma delimited, and brands of note were tagged with ® the html code for the tm symbol. Brand names also have each word capitalized and tend to be between 1-4 words and may include hyphens. To pull out each brand in the ingredient list you could do this:

matchPull(pattern = "([A-Z][\\w-']*? ){0,3}[A-Z][\\w-']+?&reg;")3

The one thing you may not recognize is the bracketed character class "[A-Z]" which means exactly what it looks, all the capitals from A through Z.

Outro


I hope from this post you’ve learned a little bit about the application of regular expressions in R and can see some opportunities to use them in your own work. The ability to precisely and effectively specify exactly the string you’re interested makes performing any text mining tasks orders of magnitude easier. Also if you’re just interested in organizing and curating data sets for thesis work for example the ability to accurately edit information stored in strings is a massive advantage over doing manual edits in software such as excel.

As always, thanks for reading!

-Chris

  1. I recently discovered the wonderful package stringr another of Hadley Wickham’s packages. The syntax for specifying perl-like mode with stringr functions is different, you just wrap your regex string in a call to the perl function like so: stringr_foo("string", pattern = perl("pattern")). Stringr is a dependency of several of Wickham’s other packages so you likely already have it installed.
  2. The function gsub is R’s equivalent of find and replace all
  3. The matchPull function is my convenience function for finding and extracting a matched text from a string. Similar in idea to str_extract from stringr, but I wrote matchPull before I knew about stringr, and I’m too lazy to learn someone else’s convenience function when I have my own. Plus as a hold over from my days coding in java I think underscores in function names are yucky. Code for matchPull is available in my post here and also comes in my friendlyShiny package.