Regex Not So Regular?

Posted on May 17, 2011 by Jeremy Jordan in Cross-Platform, VisualWorksCincom, Cincom Smalltalk, Cincom VisualWorks, Regex

Travis Griggs recently wrote on his blog:

There was a discussion about the Regex11 package. Regex11 was originally written by Vassili Bykov as a Regex library for Smalltalk. It’s actually been around a while and served the Smalltalk community pretty well, I think, in part due to Vassili’s excellent coding abilities.

I think it’s really cool that someone can sit down and write a complete Regex implementation in Smalltalk. This is part of the Smalltalk ethos—being able to invent your own future and write your own software in any area you like. On the other hand, as I read the comments, I found myself wondering, “It’s great that you can do this, but does that mean you should?” Of course, the answer varies. It depends on what your goals are. A certain part of me feels that I don’t really want to care this much about how the implementation works. Isn’t Regex one of those things that’s grown up enough now? Isn’t it something we can just use via off-the-shelf services provided by the host platforms? If we just reuse those, we might get some speed (benefits of thousands of code monkeys tuning and tweaking them over the years), and some standardization.

So I went on a little journey. Read on if what I discovered is of interest.

The Package

I put my work in a package called ExRegex. As in “External Regex,” using the External Interface features of Cincom® VisualWorks® to bind to an outside-of-smalltalk shared library. Plus I just liked the way ExRegex looked almost palindromic. And I made a test package called ExRegex-Tests. No surprises here, pretty standard operating procedure.

I only began to implement the substitution stuff; for now, it’s just good for matching. One thing I did do differently than Regex11 was employ a symmetric pair of binary selectors for doing Regex matches.

'(a|b)c' ?= 'ac' --> true
'Travis' =? '(Senior|Griggs)' --> false

The ? character always leans toward the side with the Regex pattern.

The Tests

I stole the tests from the Regex11-Testing package. I made my own SUnitToo copy of it. I soon found that I wasn’t satisfied with them. The original tests have a single test method, but said method fetches a clauseArray that is a series of inputs and expected outputs. There are three or four relatively involved methods between the loop in the single test method and actually executing a single Regex match.

This means that when you get a failure, you don’t know much. First you have to dig through the little framework of interpreting the clauseArray. When I’m trying to understand how someone’s Regex framework works, I don’t want to understand how their Regex test framework works as well. I just want to see very simple lines of code; the kind I would put in a workspace.

Secondly, as soon as you hit a failure, you’re done. The clauseArray has 137 entries in it. So if it fails for any reason (error or assertion failure) on test number 22, you have no idea how the others work. Is it just this one test? Or do others fail too? Is there a common pattern among the failures?

I think we forget sometimes that just as easy as it is to write a little DSL or interpretation framework, it’s just as easy to write a little bit of code that interprets that into real Smalltalk methods. It’s easy to generate code, it’s easy to compile it and it’s easy to install it. Here’s what I did for this particular case:

RegexTest new clauseArray keysAndValuesDo:
[:index :array |
array size > 2
ifTrue:
[ws := String new writeStream.
ws
nextPutAll: 'match';
print: index.
ws nextPutAll: ' <test> '.
ws nextPutAll: ' | regex | '.
ws nextPutAll: ' regex := ' , array first printString , ' exRegex. '.
2 to: array size
by: 3
do:
[:n |
ws nextPutAll: ' self '.
ws
    nextPutAll: ((array at: n + 1) ifTrue: [' assert: '] ifFalse: [' deny: ']).
ws
    nextPutAll: ' (regex match: ';
    print: (array at: n);
    nextPutAll: ').'].
RegexTest compile: (RBParser parseMethod: ws contents) formattedCode
classified: 'tests match']]

I didn’t even have to bother to format the code; I let the RB services do that for me. What I ended up with were nice, easy methods to read like this:

And when I run the tests from within the IDE, I get nice feedback like this:

This made it much easier to unravel things that, well, started to unravel after that.

So Much for Regularity

Regularity means (among other things) “conforming to a standard or pattern.” I’m not sure a more ironic name was ever chosen as a moniker for a “standard set” of pattern-matching. The C library interface is pretty standard. You have four basic functions:

regcomp() – Builds up a Regex structure based on a pattern and some flags.
regexec() – Compares an input source against the Regex structure for matches.
regerror() – Makes human-readable strings for your Regex structure when parse errors exist.
regfree() – Used to free the various chunks of memory associated with the regfree structure.

I knew Regexs varied a little from platform to platform. What surprised me was that the Regex11 did stuff the C library one couldn’t (e.g., \d as sugar for [0-9] and \s for separators). And vice versa, the C library variant will do [0-9]{1,3} (match 1 to 3 digits), but the Smalltalk Regex11 can’t do the {1,3} thing. You’d have to live with + (1 or more) or repeat it three times.

But the fun didn’t stop there. While both are part of the standard libc, the BSD variant that one finds on MacOSX is different than what one finds under GNU platforms, such as Linux, and not just in their behavior. While the 4 C functions are the same between the two, and the same symbolic defines for the flags exist, they actually map to different values. For example, the option RE_NOSUB (what you use when you just care if it matches or not, not where) is a 0×08 on a GNU platform and 0×04 on a BSD platform. And the regex_t structure, which you’re responsible to allocate the memory for, is entirely different between the two. For a C program using these libraries, you just use the defines and you’re fine. But for a Smalltalk or any other not-compiled-by-the-C-Compiler/Preprocessor language that more intimate knowledge of what the symbolic values mean becomes more important. Getting around these differences was interesting.

Regular? Anything but. This is kind of alarming actually. It’s not like these things give you an error when you try to use something like \d. It really just tries to match an escaped d (rather than a digit), and now it fails. The upshot is that you don’t have much portability with these things, and as you move from environment to environment, language to language, platform to platform, you’d better have a pretty good test suite around your Regexes to make sure they’re still matching the way they were when you developed your original application. On this aspect, this is one time where the “do it all in Smalltalk” actually has a compelling advantage, depending on your constraints.

There’s an excellent exhaustive table at this link that shows a number of common implementations and how they compare across a whole plethora of different options.

The Speed Factor

What about the speed issue? Again, it depends. For doing simple matches, Regex11 will come out faster; less overhead in going across the C boundaries. Let’s take a Regex for matching IP addresses:

'.*(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).*'.

If we match this against simple strings like ‘192.168.3.128’ and ‘abc’, Regex11 is faster—as much as three times faster. But if we match against larger strings, such as the guts of the IPSocketAddress class>>allAboutHostAddresses method (an expected match) and the result of Object comment (shouldn’t match), then the tables turn quite a bit. The Smalltalk version is suddenly 210 times slower! This measures both the time to create the Regex object as well as do the match.

So, I guess it depends on what kind of Regex’s you’re doing, and what kinds of frequencies (remember, it should always be “Make it Work, Make it Right, Make it Fast”).

I don’t exactly know why the pattern above actually requires the leading and trailing* values to match the ip patterns mid source. I had to add it for Regex11, but the ExRegex interface didn’t need it.

Regex Not So Regular?

The Package

The Tests

So Much for Regularity

The Speed Factor

Related posts: