Implementing Real-World Data Input Validation using Regular Expressions

Francis Norton shows how to use regular expressions to fulfil some real world data validation requirements, demonstrating techniques ranging from simple number format checks, to complex string validation that requires use of regex's powerful "lookahead" feature.

This article explains how to use .NET regular expressions to enforce the kind of logically complex input validation requirements that we sometimes confront in real specifications. This will allow us to start with basics and go on to exploit some fairly advanced features.

Because regular expressions are powerful and complex enough to be the subject of entire books, I’m going to stick strictly to their use in validation. I will entirely ignore otherwise interesting and valid topics like performance, comparison with non-.NET implementations, token extraction and replacement, in order to take you somewhere new on this topic while keeping some clarity and focus.

I will test the regexes using the PowerShell command line, which you can install from here. Because Microsoft’s architectural plan is that you can access the same .NET regex library whatever you’re writing, from ASP.NET (dead easy) to SQL Server 2005 (slightly greater difficulty – I include a reference at the end of the article that gives further details on this), the regular expression skills you learn in one context are directly transferable to another.

Some real validation requirements

These all come from real specs, I’ve simply selected some examples and arranged them in order of increasing logical complexity.

  1. Num: Numbers only. Can be negative or positive, for example 1234 or -1234.
  2. Dec: May be fixed length. A numeric amount (positive or negative) including a maximum of 2 decimal places unless stated otherwise, for example12345.5, 12345, 12345.50 or -12345.50 are all valid Dec 7 inputs
  3. UK Bank Sort Code: Six digits, either xx-xx-xx or xxxxxx input format allowed.
  4. House: Alphanumeric. Must not include the strings’PO Box’, ‘P.O. Box’, ‘P.O.Box’, ‘P.O Box’ or ‘POBox’ (any case)

Basics: Implementing NUM using “^”…”$”, “[“…”]”, “?” and “+”

This section will illustrate some core regex concepts and syntax, so if you’re familiar with the use of the above symbols in patterns, feel free to skip forwards.

Let’s take another look at the Num requirement:

  • Numbers only. Can be negative or positive, for example 1234 or -1234.

I take this to mean that we’ll accept anything consisting of an optional minus sign followed by one or more digits.

We can specify the “one or more digits” part by using square brackets and a dash for character ranges, and the plus sign (“+”) for repetition. Let’s start with character ranges, in this case the range of characters from “0” to “9”:

NOTE: If you’re new to PowerShell, you can read “[Regex]::IsMatch” as “use the static method ‘IsMatch’ of the .NET library ‘Regex'”. In fact we could use PowerShell’s “-cmatch” operator, which is precisely equivalent to a [Regex]::IsMatch() expression, but I like the clarity of using the .NET class directly. The square bracket expression is a character class. In effect, it gives us a concise way of doing a character-level OR expression, so “[0-9]” can be understood as “does the input character equal 0, 1, 2…or 9?” The dash (“-“) acts as a range operator in this context so “[0-9]” is exactly equivalent to “[0123456789]”. At the moment we’re simply testing whether the test string contains a match for the regex, which would be fine for searches, but when we’re doing validation we want to ensure that the test string doesn’t also contain non-matching text. For example:

We can stop that behaviour using the special characters “^” and “$” to specify that the regex pattern must match from the start to the end of the test string:

Now we’ll make the regex accept one or more digits by using the “+” modifier on the “[0-9]” character class. The “+” means, in general, “give me one or more matches for whatever I’ve been attached to”, so in this case means “give me one or more digits”.

That just leaves the optional minus sign. The good news and the bad news is that outside a character class (like “[0-9]”) the dash is just a literal character (good news because it means we won’t have to escape it; bad news because treating the same character as a literal in some parts of a pattern and a special character in others is a triumph of terseness over readability). We’ll make it optional with the “?” modifier, which can be read as “give me zero or one matches”.

Using “{” … “}”, “(” … “)”, “\”, and “d” to implement Repetition

These “?” and “+” modifiers are very nice and convenient, but suppose we have a counting system that can express more than None, One, and Many? Let’s take another look at the DECIMAL format requirement:

  • Dec: May be fixed length. A numeric amount (positive or negative) including a maximum of 2 decimal places unless stated otherwise, for example 12345.5, 12345, 12345.50 or -12345.50 are all valid Dec 7 inputs

Ignoring the fixed length option for now, let’s look at the decimal section. It seems that we’re expected to accept numbers with a decimal point and one or two decimals or with no decimal point and decimals at all. Our first challenge is the decimal point. We want to use the “.” sign, but this gives us some strange behaviour:

We’ve discovered that “.” is a special character in regular expressions – in fact it matches any character. We need to escape it with the “\” prefix to make it a literal:

The next step is to use the braces modifier to specify that we want one to two digits following the decimal point – we can put the minimum and maximum number of matches (in our case 1 and 2, which we’ll test with zero to three) inside the “{” and “}” curly brackets:

Now we can add the entire decimal suffix pattern, ” \.[0-9]{1,2}”, to our existing number pattern, and test it:

Aha, we should still be accepting numbers with no decimal places, but we’re not. We know how to make a single character optional using the “?” modifier, but how can we do this to larger sub-patterns? The pleasantly obvious answer is to use parentheses to wrap the decimal suffix sub-pattern in “(” and “)”, and then apply the “?”.

And before we leave this pattern, one more trick to make regular expressions more readable and more robust: we can replace “[0-9]” with “\d” (escape + d) which is pre-defined to mean “any digit”. Be aware that this is case-sensitive and “\D” means the opposite!

Using “|” to implement a logical OR

We know how to use character classes, i.e. the “[” … “]” expressions, to accept alternative single characters, but the requirement for UK Bank Sort Codes requires us to accept input strings that fall into one of two different patterns. Let’s take another look at the requirement:

  • UK Bank Sort Code: Six digits, either xx-xx-xx or xxxxxx input format allowed.

Accepting either one of these on its own is straightforward (remembering that “-” is just a literal character outside character classes):

We can match one pattern or the other using the “|” (or) operator. We’re going to have to use parentheses too, as we’ll discover when we start testing.

What happened when we matched that second value? The “$” sign at the end of the pattern was intended to reject input with text following the sort code itself, but the “|” meant that it was only applied to the right-hand sub-pattern. (Try working out how to get a sort code with leading junk accepted by the pattern above) We can fix this by using parentheses again:

Using “(?=” … “)” to implement a logical AND

You may have noticed that we have some unfinished business with the Decimal requirement, specifically that sentence “May be fixed length”. It’s clear from the examples that the fixed length refers to the number of digits, not the number of characters (which could include minus signs and decimal points).

We could adapt our existing decimal pattern, with its optional minus sign and decimal point, to restrict input to just seven digits, but this is inadvisable. It would be better to keep our existing pattern, which is relatively simple and well-tested, and apply a second regular expression to count the number of digits, each optionally preceded by a non-digit character.

Remembering that “\d” means “any digit” and “\D” means “any non-digit”, we can do this to restrict the input to, say, no more than seven digits:

This is fine if we’re in a position to validate a single input with multiple regular expressions, but sometimes we’re going need to do it all in one regex. This raises a problem – both of our expressions necessarily start at the beginning of the input string and work their way, character by character, to the end. If we are going to do “logical and” patterns as opposed to simply “and then” patterns, we need a way of applying multiple sub-patterns to the same input. Fortunately .NET regular expressions support the obscurely named, but very powerful, “lookahead” feature which allows us to do just that. Using this feature we can, from our current position in the input string, test a pattern over the rest of the string (all the way to the end if necessary), then resume testing from where we were. A lookahead sub-pattern is wrapped in “(?=” … “)” and here’s how we can use it to implement the requirement “up to seven digits AND a valid decimal number” by combining our two existing patterns:

And this completes our implementation of the decimal requirement.

Using “(?!” … “)” to implement AND NOT

Our final input validation requirement was for address lines, to exclude any that used a PO Box instead of a real (residential) address.

As usual, let’s revisit the friendly spec:

  1. House: Alphanumeric. Must not include the strings’PO Box’, ‘P.O. Box’, ‘P.O.Box’, ‘P.O Box’ or ‘POBox’ (any case)

Let’s first implement the rule that the string must be alphanumeric. This means that the string can contain alphabetic and numeric characters, spaces, dashes, full stops (period), commas or slashes. We can implement this rule quite easily, remembering that the space character is a literal, not a separator:

Now let’s write a pattern that will find any obvious variation of “P O Box” anywhere after the start of the input, which is where we test it. Remember from earlier that the space character is a literal, and that the “.” is a special character unless we escape it, “\.”

Next, we’ll reverse the result by asking for the pattern not to be found, and combine it with our alphanumeric pattern, both done using “(?!” … “)” notation:

Finally, we’ll make the PO Box rule case-insensitive. This can be done by setting a mode at the start of the expression that will apply to everything that follows it. We can specify “case insensitive mode” with the notation “(?i)” – notice that since we’re going to be case-insensitive anyway, I’ve also simplified the alpha bit of the alphanumeric pattern

Conclusion

Like any good tool, regular expressions can be used or abused. The purpose of this article is to help you write regular expressions that are fit for the purpose of validating inputs against typical business validation rules.

In order to do this we’ve covered writing straight-forward patterns using literals, special characters and character classes, and applying them to the whole input using “^” … “$”. We’ve also seen how to combine simple patterns to implement logical OR, AND and NOT rules.

References

Using Regular expressions to in SQL Server 2005:
http://msdn.microsoft.com/msdnmag/issues/07/02/SQLRegex/default.aspx

Regular expression options in the .NET library:
http://msdn2.microsoft.com/en-us/library/yd1hzczs(VS.80).aspx

A concise summary of all special characters recognised by .NET regular expressions:
http://regexlib.com/CheatSheet.aspx