Saturday, 4 June 2011

Simple Regex #3½: Balancing Groups

Undo Revisited

I've been taken to task for failing to provide a Regex-based solution to the problem in the previous article of this series, namely, removal of backtracked elements from a path. One correspondent even left a comment containing the missing Regex solution! Another complained that my I/O path specific .Net workaround would be zero help to someone just happening upon my article, maybe while researching some other aspect of the underlying patterns.

It's a fair cop, and the only way to salvage even a wee bit of reputation is to provide a full explanation of the proposed Regex solution. In mitigation, the proposed mechanism (Balancing Group Definitions, described below) is also unique to the .Net implementation of Regex. But hey, I'm only the piano player...

This is one of those curious cases where the attempt to generalise the problem reveals a simplifying hidden structure. The first underlying pattern mentioned above is the Undo (or sometimes the Delete) pattern. Consider the original example:
The entire midsection "\client\forms\..\.." is redundant; we'd like to delete such missteps, obtaining a "normalised" answer such as
Consider the operating system's behaviour as it attempts to follow the original verbose path. It acts like a finite state machine. Each time it encounters a subdirectory name, like \client or \forms, it enters that subdirectory. Each time it encounters the parent token \.. it reverses back out of the most recently entered subdirectory.

But look what happens when we replace the phrases subdirectory name with letter, and parent token with backspace. Now the problem can be seen to be completely analogous with that of correcting a typed word.

A Typo Processor

Let's use the # symbol to represent the pressing of the backspace key. Then our new problem is to take an input comprising a sequence of letters and # symbols, and parsing it, recover the correctly typed word. For example, if I misspell yield as yeild but correct the mistake manually before completing the word, then the input string (the received sequence of keystrokes) might be yei##ield, and the required output is, of course, yield.

The pattern that our Typo Detector has to detect is: a number (one or more) of # symbols, immediately preceded by exactly that same number of letters. This then is the pattern - in the example case, ei## - that must be removed, in order to, erm, yield the corrected word.

Following the naive reasoning of the original article, we might try using something like \w# (one "word" character, followed by a backspace token) to match the typo pattern. And just as before, this fails in the second simplest test case (the ei## of the previous paragraph). What we need is \w\w## to match that case. But then we also need \w\w\w### to detect the case where three consecutive mistyped characters are corrected. And so on. In fact what we need is approximated by the pseudopattern

where n cannot be specified in advance, but
  • is greater than zero, and
  • represents the same number in both positions.
Actually, we want even more than that. What about cases where I make a mistake while correcting a previous typo? For example yei#i##ield. Ideally we'd also like permuted edit patterns like this one, \w\w#\w##, extracted as entire single units. Unbelievably, this functionality is readily provided by the Microsoft proprietary Regex extension known as...

The Balancing Group Definition

But first, let's revise the Group concept from the ground up. Groups are simply added to a Regex by surrounding any pattern subexpression in parentheses. For example, the pattern

not only matches any BT landline number in Glasgow, but also allows the central three-digit exchange code to be retrieved conveniently as a group for use elsewhere. These anonymous groups are retrieved by numerical index, which can quickly get a bit too fiddly for comfort.

The next level of usability is the named group. The prefix ? within a group introduces its name, which should be enclosed in either apostrophes or angle brackets. For example, the pattern

0141(?'exchange' \d{3})\d{4}
allows retrieval of the three-digit exchange group by name. And we are nearly there. Because it's an interesting implementation detail about groups, and in particular named groups, that they are pushed on to a stack as they are identified. This means that they can also be popped off. If we can push a group for each word character typed, pop one for each backspace token, and determine when the stack becomes empty, then we can detect all redundant groups in the input stream.

The balancing group definition will delete ("pop") a previous group, and replace it ("push") with the interval between the previously defined group and the current group. Syntactically, we append the previous group name to the current one, using a hyphen for separation. So it looks like this:

(?'curr-prev' subexpression)
The current group name curr is optional, since nothing says that all groups have to be named. The previous group name prev is mandatory; we do need some way of referring to it, since we're about to pop it.

Here is the magic pattern:

Since it detects an erroneous word character which is subsequently deleted, it begins naturally enough with \w( and ends with )#. Hey this is easy! Now look at the first interior pattern, west of the asterisk:

This is saying a couple of things:
  • If there's another word character \w then push a new group, which we call N.
  • Otherwise if there's a backspace token # then pop the most recent group named N.
The asterisk allows this process to happen any number of times, including zero. This corresponds to typing some characters, deleting a few, typing some more, and so on. Penultimately, east of the asterisk and just prior to the final #, we find another expression:

This is a trick! If there are still undeleted word character groups on the stack, i.e. an N that hasn't yet been popped, this construction forces a match against the pattern (?!) about which all we need know is that it will fail. Well okay, it's called an expressionless negative lookahead, if you really must know everything! Otherwise so long as the final # appears, we have found a wholly redundant segment.

Back To The Path

If in our magic pattern we now replace \w with a pattern for a subdirectory name, such as \\\w+, and if we further replace # with the parent directory pattern \\\.\. throughout, then we transform it into the path normalizer that was the subject of the original article. The details are horrendous, because filesystem subdirectory names can contain other than word characters, including dots. We have to exclude \.. from these, and \. requires even more special handling. Nevertheless we have solved the fundamental problem of applying Regex here, and as it so happens that all of my subdirectory names were well behaved and dotless in any case, this solution would have worked for me.

Further Generalisation

The Undo/Delete pattern itself represents a simple subset of the problems that these Regex Grouping Constructs are capable of solving. Refer to the MSDN Regex topic on the applicability of Balancing Group Definitions for a complete guide to applying these constructs in any given recursively nested construct parsing context, such as matching parentheses, opening and closing brackets, quotation marks, mathematical expressions, or lines of program code containing multiple nested calls (most of which cannot be parsed in Regex implementations other than the .Net one).

Though be warned, the expository technical language used over there assumes at least some familiarity with the domain of abstract formal languages. Perhaps a better, more gentle introduction to balancing groups can be found in either of these articles:
Update (20 June 2011): I was going to share my .NET Regex Tester online with you, before I noticed that Derek Slager's is, in his own words, better... so, here's a link to his:
Good luck!

No comments:

Post a Comment