In this article, I will cover a bit about the LIKE
operator, including how it works, and a bit of history about why it is like it is. After establishing this, I will discuss a bit about how you can (and should) use the LIKE
operator in your CHECK
constraints to strengthen your data integrity.
The LIKE Operator
The LIKE
operator has been in SQL since the very beginning. It’s a very weak form of pattern matching, that has only two simple wildcards. The wildcards are ‘_
‘ (underscore) and ‘%
‘ (percent). They respectively stand for any single character and for zero or more characters in a substring. The syntax is pretty straightforward:
1 2 |
<target string> [NOT] LIKE <pattern string> [ESCAPE < character>] |
Both parameters have to be expressions of type CHAR(n)
or VARCHAR(n)
. People who are new to SQL will sometimes try to match the pattern to a temporal or numeric data type. If you really must do this for some reason, please remember that you first need to CAST()
them to a character data type. This is a habit probably left over from COBOL or other languages that don’t have a rich set of data types.
People who are new to databases and SQL also forget that we have functions UPPER()
and LOWER()
to change the case of letters. You can use them in the INSERT INTO
statements guarantee that your data follows “Brent’s Rule” which states that you should “store data the way you use it and use it the way it’s been stored!”
If the <pattern string>
matches the <target string>
, the result of this expression is TRUE
, otherwise it’s FALSE
. But if either the <pattern string>
or <target string>
is NULL
, then the result of this expression is UNKNOWN
. Please remember that a NULL
is not the same as an empty string. An empty pattern and the empty target defaults to TRUE
. Trivially, every string has a minimum substring of length zero.
Another confusion is that the LIKE operator is not equality. In SQL, when you compare two strings for equality, you pad the shorter string with trailing blanks, so that 'abc'
is not LIKE ‘abc '
, but they are =
(equal).
Note:Not all programming languages do this! For example, the Xbase family truncates the longer string. Some languages are case insensitive, so ’abc’is equal to ‘ABC’ in them. This is one reason why ISO and ANSI standard encoding schemes specify a fixed length, and if letters are used, the case and position of those letters is also specified (usually uppercase). Another reason is so you know how to layout display screens and paper forms that could be messed up by varying length strings.
The percent sign as a wildcard works in a straightforward manner. It doesn’t usually occur in the middle of a word or text string; it’s got a clear and easy to see shape and it is not part of the shared punctuation marks used in all the alphabets that have Unicode support.
The use of underscore as a wildcard is unfortunate. It’s not used in the other regular expression tools. Back when we had impact printers and green bar paper (links provided if you were unaware of such a thing in the past!), it was easy to tell four underscores from five because the printer alignment was that bad. But then we got good printers and laser technology.
They all became a single line and it was hard to try and count them. To demonstrate my point, consider “_______” versus “____”; pretty hard to see a different difference, isn’t it? And pretty much impossible to count them without clicking a key over and over and saying 1, 2, 3, …
Add to the problem, the underscore character_is allowed in data element names in SQL and other programming languages. In fact, it’s recommended to separate multiple words in a single data element name in several ISO standards. That means we like writing things like "employee_name
“; after a few centuries of writing on lined paper, we expect Latin letters to fall on the baseline and it’s measurably easier for us to read. Other conventions like mixed case, hyphens, and CamelCase actually make your I jump, so you get errors and slow down the reading of the text.
We inherited these two tokens from IBM when we are working on the standard but I honestly do not know their history. If anyone can comment on it, I’d like to hear.
The optional ESCAPE
clause is seldom used, but when you needed., it is very handy. The character that follows it is given a special meaning that says whatever appears after the escape token in the target is to be taken as a literal. In practice, the most common characters used are \ and /, they are easy to see, and are not much used in other strings. This is how your search string can include actual percent and underscore symbols without treating them as meta-symbols.
Enforcing Constraints Versus Searching for a Pattern
As I have shown in the previous section, these two tokens actually do a pretty good job, if you are searching for a pattern in a character column or expression. You can accept false positives in many searches. For example, consider this simple search:
1 2 3 |
SELECT emp_first_name, emp_dept FROM Personnel WHERE emp_first_name LIKE 'Jo%'; |
It will return all the employees like 'Joe'
, 'John'
, and ‘Joan'
. But it will also return weird stuff like 'Jo Jo The Jungle King'
! Unless you happen to be a fan of Golden Age jungle comics, you probably didn’t want this particular fictional character as one of your employees.
You can use a LIKE
pattern in a check constraint to keep some data out. Remember there’s a difference between validation and verification; any five digits is a valid ZIP Code, but not all five digits numbers have been issued as ZIP Codes. you must verify your ZIP Codes by doing table lookup on a database from the post office.
Unfortunately, the original two meta characters are not very good when you try to validate a column. They are both weak and too general. To fix this, the SQL 2016 Standard borrowed something from grep(). The syntax is:
1 |
[<list of characters>] | ^[<list of characters>] |
Any one of the characters enclosed in square brackets is to be matched against a character in the target. It’s a generalization of the original underscore. The ^
is like the NOT
in other SQL constructs; it means that none of the characters in the target is in the list. Remember that a blank is a character, so your list may or may not want to include a blank. The usual conventions for all apply here too.
There is a shorthand for a list of characters to save you typing. 'a-z'
means all the lowercase Latin letters. 'A-Z'
is shorthand for all the uppercase letters. Finally, '0-9'
is the shorthand for the 10 basic digits. You can do some other things with the embedded-indicate a range but be careful because coalitions and Unicode character sets can affect things. (For example, in a typical SQL Server database, the DBA has likely created the database to be case insensitive by default. So ‘A-Z’ will include the lowercase as well.)
Again, write, clean portable code and do not try to use the fancy features in one release of one SQL product. Even when you have no plans to ever change platforms, some day those plans may be made for you and then you have to deal with the differences.
Consider the simple problem of making sure that a ZIP Code is exactly 5 digits.
1 2 3 4 5 6 |
CREATE TABLE Personnel (… emp_zip_code CHAR(5) CHAR(5) NOT NULL CONSTRAINT valid_zip CHECK (emp_zip_code LIKE ‘[0-9][0-9][0-9][0-9][0-9]’), ..); |
Put this simple CHECK()
constraint in your DDL, so you do not have to put it in the application code (or at least you can be 100% sure that the data is always 4 numeric values even if it is done in the UI!). You just protected your SQL today, the SQL you are writing in the future. Not a bad payoff for a few lines of typing!
There are a lot of places where this is useful. For example, if you want to make sure a string has at least one character in it: column_name like '_%'
will handle this need. Any time you can keep your data clean, the better.
A bit more history
In an outdated version of the SQL Standard, there was the keyword SIMILAR TO
, which introduced a pattern match in the sense of POSIX conventions. The reason that it showed up was that POSIX was under consideration as a federal standard. Basically, this version of pattern matching used “[:< keyword>:]” where the <keyword> could be the names of various common regular expressions such as DIGIT
, ALPHA
or ALPHANUMERIC
. This got around problems with character sets but was not so good for portability.
It can get even worse. Some implementations offer other pattern matching in this sense, but use arbitrary keywords like REGEXP
, REGEXP_MATCHES
, REGEXP_LIKE
or operators like '~'
to activate the feature. You could also activate XQuery semantics with the keyword LIKE_REGEX
to XQuery 1.0 / XPath 2.0 function fn:matches()
.
Editor note: SQL Server in Azure and presumable on prem will support REGEXP_LIKE and others in the future. More info here. PostgreSQL already does, and includes SIMILAR TO. More info here.
More complex searching
The SQL standard is complemented by ISO/IEC 13249 SQL Multimedia and Application Packages. Part 2 Full-Text defines routines and methods for information retrieval in text documents. It deals with objects like word, phrase, sentence, and paragraph. There are special constructs; to search for words or phrases, search words within a definable distance, thesaurus features like broader terms of a word, soundex
-similarity, ranking of results, and much more.
The central method of these features is CONTAINS
. Here is one example:
1 2 3 4 5 6 |
-- Retrieve rows with 'word_1' and 'word_2' -- (in this order) within a sequence of 10 words. SELECT * FROM Foobar WHERE CONTAINS(' "word_1" NEAR "word_2" WITHIN 10 WORDS IN ORDER ') = 1; |
Such features are far beyond regular SQL pattern matching. To perform efficiently, they will require special indexing as found in text bases. SQL is simply the wrong tool for this job! We never intended for SQL to be a text base or general document tool.
Summary
In this article, I have shared a lot about comparisons that are based on part of string values, but to bring it back to the main topic, it was always intended for strings to be checked and validated. Please don’t disappoint those of us that worked so hard on this and use LIKE
expressions as often as they are useful to make sure strings meet your desired needs.
Load comments