RE2: a principled approach to regular expression matching
Thursday, March 11, 2010 | 12:00 PM
Labels: re2, regular expressions
Regular expressions are one of computer science's shining examples of the benefits of good computer science theory. They were originally developed by theorists as a way to describe infinite sets, but Ken Thompson introduced them to programmers as a way to describe text patterns in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. They remain a key feature of the open source landscape today, in those venerable Unix tools and at the core of new languages like Perl, Python, and JavaScript.
The feature-rich regular expression implementations of today are based on a backtracking search with a potential for exponential run time and unbounded stack usage. At Google, we use regular expressions as part of the interface to many external and internal systems, including Code Search, Sawzall, and Bigtable. Those systems process large amounts of data; exponential run time would be a serious problem. On a more practical note, these are multithreaded C++ programs with fixed-size stacks: the unbounded stack usage in typical regular expression implementations leads to stack overflows and server crashes. To solve both problems, we've built a new regular expression engine, called RE2, which is based on automata theory and guarantees that searches complete in linear time with respect to the size of the input and in a fixed amount of stack space.
Today, we released RE2 as an open source project. It's a mostly drop-in replacement for PCRE's C++ bindings
and is available under a BSD-style license. See the RE2 project page for details.

6 comments:
Nickname unavailable said...
This is incredible, well done!
March 12, 2010 3:41 AM
Damen said...
Awesome!
Thanks for making this research open source.
March 12, 2010 7:09 AM
None Type said...
I cannot tell you how many times I wanted a PRCE library for C++ but was too lazy to write one and instead resorted to specific parsing using the standard string library. Thank you, once again, Google.
March 12, 2010 7:18 AM
FistsOfTinsel said...
I built the code under Cygwin, but I get the following failure for one of the tests - it seems to be a legacy test (in re2_test.cc). Is it important?
TEST(RE2, DeepRecursion) {
// Test for deep stack recursion. This would fail with a
// segmentation violation due to stack overflow before pcre was
// patched.
// Again, a PCRE legacy test. RE2 doesn't recurse.
rlimit rl;
rl.rlim_max = rl.rlim_cur = 1 * 1024 * 1024;
CHECK_EQ(0, setrlimit(RLIMIT_STACK, &rl));
string comment("/*");
string a(16384, 'a');
comment += a;
comment += "*/";
RE2 re("((?:\\s|//.*\n|/[*](?:\n|.)*?[*]/)*)");
CHECK(RE2::FullMatch(comment, re));
}
March 12, 2010 2:35 PM
FistsOfTinsel said...
Are the compiled regexps safe to use in a multithreaded environment? Specifically, we are using the RogueWave SourcePro C++, and I've discovered that I can't make my regexp objects static and use them across multiple threads, as they apparently have member data they use during the search routines. I'd like to be able to declare unchanging expressions statically so as not to incur compilation costs every time I need to use them.
March 12, 2010 2:39 PM
hzmester said...
I have added re2 results to my regex comparison page.
Excellent work! RE2 is really fast. Clever state caching seems well worth the efforts.
March 20, 2010 5:16 AM
Post a Comment