Introduction to Regular Expressions in Oracle

Regular expressions – RegEx or RegExp to their friends – were fashionably late in coming to the Oracle party. The seeds of what we know today as regular expressions were formulated in the period immediately following the Second World War in fields as diverse as formal language theory and neurophysiology; but it wasn’t until the PL/SQL Web toolkit arrived for… Continue Reading →

Regular expressions – RegEx or RegExp to their friends – were fashionably late in coming to the Oracle party. The seeds of what we know today as regular expressions were formulated in the period immediately following the Second World War in fields as diverse as formal language theory and neurophysiology; but it wasn’t until the PL/SQL Web toolkit arrived for 8i and 9i that they popped their heads over the Oracle parapet. By this time they were already rock stars in the worlds of Unix and Perl.

But I’m being rude here; I should introduce you. Regular expressions are a codified means to accomplish flexible pattern-matching in strings of text.

And, if you’re rereading that definition and thinking ‘huh?!’, that’s a feeling you might want to get used to, because, while regular expressions are powerful and definitely useful, they can seem very much like hieroglyphics to the uninitiated. But don’t worry, I’ll hold your hand.

String pattern matching has always been possible in Oracle sql and pl/sql, of course. With the LIKE condition and the simple metacharacters of “%” and “_”, character patterns could be described and integrated into queries. However, the moment you required a little complexity these basic wildcards cowered in fear. An example: my name is David, but friends tend to call me Dave when they have a beer or two in them. To match either spelling of my name, not using regular expressions, you would have to query the table for ‘Dav%’. But then your results might be flooded with Davidsons and Davises and Davinas. And I’m definitely not pretty enough to be called Davina!

Even more impossible in a regular-expression-less world is something that sounds like the simplest of tasks – coming up with a pattern to match an email address.
Oracle’s implementation of the regular expression functionality is wrapped up in the following functions: REGEXP_LIKE, REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_REPLACE and REGEXP_COUNT, and they’re like the superhero alter egos of the more mundane LIKE, INSTR, SUBSTR, REPLACE and COUNT functions.


REGEXP_LIKE REGEXP_LIKE(<source_string>,<pattern>,
<match_pattern>)
REGEXP_INSTR REGEXP_INSTR(<source_string>, <pattern>[[,
<start_position>][, <occurrence>][, <return_option>][,
<match_parameter>][, <sub_expression>]])
REGEXP_SUBSTR REGEXP_SUBSTR(source_string, pattern
[, position [, occurrence
[, match_parameter]]])
REGEXP_REPLACE REGEXP_REPLACE(<source_string>, <pattern>,
<replace_string>, <position>, <occurrence>,
<match_parameter>)
REGEXP_COUNT REGEXP_COUNT(<source_string>, <pattern>[[,
<start_position>], [<match_parameter>]])

[The possible match parameters are as follows: ‘c’, which stands for ‘case sensitive’; ‘i’, which stands for case insensitive; ‘m’, which means treat the source string as multiple lines; ‘n’ which means allow a period (.) to match new line; and ‘x’, which means ignore whitespace.]
The true power of regular expressions, however, lies not in these functions but in the amazing versatility of the available metacharacters.


. Matches any single character
x? Matches ‘x’ 0 or 1 times
x* Matches ‘x’ 0 or more times
x+ Matches ‘x’ 1 or more times
x|y Matches either ‘x’ or ‘y’
x{n} Matches ‘x’ exactly n
times.
x{n,m} Matches ‘x’ at least n
times, but no more than m times
x{n,} Matches ‘x’ at least n
times
[xyz] Matches ‘x’,’y’ or ‘z’
[^xyz] Matches any single character other than ‘x’,’y’ and ‘z’
(xyz) Takes ‘xyz’ as a group, rather than individual characters
a-z Matches any character from ‘a’ to ‘z’

The list is much longer – this introductory article will not teach you all you need to know about regular expressions – and, as with every new skill, adeptness will only come with practice, through encountering real-world problems and surmounting them. Practically every string pattern can be matched using regular expressions.
An example: imagine you’re looking for a drink. However, you want to return every complexion of the word: drink, drank, drunk, drinking and drunkard. A regular expression is perfect for this, using the rules above:

SELECT col1
FROM table1
WHERE REGEXP_LIKE (col1,  ‘dr(ink(ing)?|(unk)|(ank)|(unkard))’,’i’);

The simple pattern above uses the following rules: it uses numerous pairs of parentheses to group characters, the ‘?’ to indicate that the grouping immediately before it – ‘(ing)’ – may appear 0 or 1 times, and the ‘|’ to indicate that the pattern should match any of the groupings. And the match parameter ‘i’ indicates that we want our pattern to be case insensitive.

However, before you run off and re-engineer your whole database and every application you have ever built, peppering them with regular expressions, it is worth mentioning that they are not the solution to every problem and they do have their drawbacks.

Firstly, there is their complexity. Believe me, no one is completely fluent in regular expressions; they will always be harder to read than simpler matching functions. So if LIKE, INSTR, SUBSTR and REPLACE meet your needs, you should consider sticking with them.

Also, be aware that like the functions they replace, Oracle’s regular expression functions can result in full-table scans that lead to horribly inefficient queries. Ordinary indexes cannot handle their awesomeness. Fortunately, it is possible to use regular expressions in function-based indexes, speeding up queries that use regular expressions – and even those that do not – considerably.

Oh, and by the way, the following regular expression will match those email addresses for you.

REGEXP_LIKE (email_address, '[a-zA-Z0-9._%-]+@[a-zA-Z0-9._%-]+\.[a-zA-Z]{2,4}');

And, how rude of me, I don’t think I properly introduced myself. My name is:

REGEXP_LIKE(first_name, 'Dav(e|id)');