Thursday 11 August 2011

Simple Regex #4: Assertions

That Old Saw

Another Regex question, another blog post! This time, Colleague A wants to highlight search terms by inserting some CSS spans into his HTML. It's the old problem of ignoring the content of tags during substitution. For example, suppose he highlights the search term school by applying the CSS class xred. This causes injection of spans like this into his output:
Alan went to class at <span class="xred">school</span> next day.
So far the rendering is as expected:
Alan went to class at school next day.
Trouble is, he next wants to highlight the search term class in xgreen:
Alan went to class at school next day.
Doing another simple replace operation would cause the new, "false positive" instances of the term class (occurring within the previously added HTML tags) to be incorrectly incorporated into new xgreen spans, resulting instead in invalid markup, and an onscreen train wreck.
Alan went to class at class="xred">school next day.
The following method of avoiding substitutions within tags is far from the best available solution, even among those offered by Regex. However, it does provide a conveniently simple example to let me introduce another Regex concept, namely Assertions.

Zero Width Assertions

In the context of regular expressions, an assertion is simply a statement about the context of the current match. The simplest examples are the "begins with", "ends with" and "exact match" assertions, which use the ^ and $ characters to match the beginning and end of the line, respectively. Suppose our input is 12345. Then the pattern 234 will match this successfully. The pattern ^234 will not, because the caret ^ constrains the matching to occur at the start of the input string, and ours does not start with 234 (the pattern ^123 will of course succeed). Similarly in the pattern 234$, the dollar $ constrains the matching to occur at the end of the input string, and so this will also fail to match our input, whereas 345$ will succeed. Finally, using both assertions, ^12345$ will match only the entire input string 12345, while any other sub-pattern such as ^1234$ fails.

These two meta characters are examples of atomic zero-width assertions, also known as anchors. Zero-width means that they do not cause the engine to advance through the string or consume characters; they just assert something about the current position in the input. There are several other anchors, but now I want to turn to more general assertions.

Assertion Types

Both of the above examples can be characterised as positive assertions, in that they require something to be true about the current position; namely that it is the start, or end, of the line. You could imagine another assertion type, where you want to match a particular sequence anywhere except either or both of those positions. And yes, such negative assertions are available, as we'll soon see.

Another characteristic of assertions is the direction they face. The caret ^ is termed a look-behind assertion; it states that no input exists before the current position. Similarly the dollar $ is termed look-ahead, as it states there's no further input available beyond the current position.

Finally, the new assertions I'm about to introduce, while still zero-width, are no longer atomic. Instead of referring to known fixed points, such as the beginning or end of the string, line, previous match, or word/non-word boundary, these new assertions contain their own independent subexpression patterns, which they assert do (or do not) occur immediately before (or after) the current position in the input. The parentheses here reflect the difference between positive and negative assertions, and between look-behind and look-ahead behaviour. So there have to be four types, right?

They all use syntax similar to grouping constructs. Positive assertions are indicated by =, negative by !. By default these are assumed to be look-ahead, unless prefixed by < to signify look-behind.
positive look-ahead: (?= subexpression)
negative look-ahead: (?! subexpression)
positive look-behind: (?<= subexpression) ◀ this is the one we are going to use
negative look-behind: (?<! subexpression)
Back To Highlighting

Refer to MSDN for examples of each of these. Now I just want to finish by showing how a zero-width positive look-behind assertion can be used to protect the content of those HTML tags.

How can we write a pattern to match search terms ignoring tag contents? If we're inside a tag, then there's a < somewhere to our left, as yet unmatched by a closing >, while if we're outside, then either there are no < on the left, or else every such < is matched by a corresponding > also on our left. In other words, we want everything on our left to consist of: start of line ^, followed by any number of (a) non-< characters, or else (b) < characters with eventual matching > characters:
^([^<]|<[^>]*>)*
Finally, wrap that as the subexpression in the third template above, and we're done:
(?<=^([^<]|<[^>]*>)*)
Code Sample
private static string Highlight(string input, string search, string cssClassName)

{

  const string assertion = "(?<=^([^<]|<[^>]*>)*)";

  var pattern = string.Format("{0}{1}", assertion, search);

  return Regex.Replace(

      input,

      pattern,

      match => string.Format("<span class=\"{0}\">{1}</span>", cssClassName, match.Value));

}



private static string Test()

{

  return Highlight(

      "<span class=\"ignorethisclass\">This</span> is the class.",

      "class",

      "highlight");

}



// Output: <span class="ignorethisclass">This</span> is the <span class="highlight">class</span>.

Disclaimer

Remember, this is only an illustrative example. As a practical HTML postprocessing solution, there's still a number of holes in it!

No comments:

Post a Comment