Tuesday 8 February 2011

Simple Regex #2: Validation

What's a Valid XML Element Name?

Sometimes we want to convert customer entered data to XML. Sometimes we want to use it for an element name. Obviously it'll need some sanitising, so what should we escape? The XML RFC is a wee bit twisty on this question, its section on Start Tags defining a Name roughly, i.e. ignoring so-called combining characters and extenders, like this:
NameStartChar ::= Letter | ‘_’ | ‘:’
NameChar ::= NameStartChar | Digit | ‘.’ | ‘-’
Name ::= NameStartChar (NameChar)*
That's similar to the definition of an identifier in many languages, but with the addition of a few specific punctuation marks. The twist comes when you consider namespaces. The XML Names recommendation states that these assign a meaning to names containing colon characters, and that therefore, authors should not use the colon in XML names except for namespace purposes. Even though XML processors must still accept the colon as a valid name character, as per the above syntax, it gives off the odour of a practice to avoid. So we go with this:
NameStartChar ::= Letter | ‘_’
NameChar ::= NameStartChar | Digit | ‘.’ | ‘-’
Name ::= NameStartChar (NameChar)*
No Colons Then?

That's right. Our element names start with a letter or underscore, then continue with any number of these, possibly in combination with digits, periods, and hyphens. To put it another way (inexactly, but in practice acceptably, for my purpose): an element name is any nonempty sequence of word characters (letters, numbers, underscores), periods, and hyphens; and it must start with either a letter or an underscore.

In the interests of localization, rather than the parochial a-zA-Z_0-9, we should use the Regex word character class \w to represent, erm, word characters. That just leaves the period and hyphen to be mopped up in the main sequence. Similarly, when it comes to specifying the initial letter, rather than a-zA-Z, we should use the letter class \p{L} built for just this purpose:
private static string ToElementName(string input)
{
  // Replace all non-hyphen/period/word characters with underscores.
  var result = new StringBuilder(Regex.Replace(input, @"[^-.\w]", "_"));
  // If input doesn't start with a letter or underscore, prepend an underscore.
  if (!Regex.IsMatch(input, @"^[\p{L}_]"))
    result.Insert(0, '_');
  // Done.
  return result.ToString();
}
A point to note about the first pattern [^-.\w] is that neither the hyphen nor the period need be escaped. Within brackets, the period represents itself, rather than being a wildcard; and the hyphen is similarly literal (as opposed to indicating a range) when it appears as the first item in a set.

Other Useful Character Classes

Why yes, there are some others, I'm glad you asked. These two are probably the droids you're looking for: \p{Lu} for uppercase letters, and \p{Ll} for their lowercase comrades. For the full story about Character Classes in C#, go to http://msdn.microsoft.com/en-us/library/20bw873z.aspx.

No comments:

Post a Comment