{"id":102729,"date":"2024-08-19T03:39:18","date_gmt":"2024-08-19T03:39:18","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=102729"},"modified":"2024-08-09T16:10:19","modified_gmt":"2024-08-09T16:10:19","slug":"like-constraints","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/theory-and-design\/like-constraints\/","title":{"rendered":"LIKE Constraints"},"content":{"rendered":"<p>In this article, I will cover a bit about the <code>LIKE<\/code> operator, including how it works, and a bit of history about why it is like it is. After establishing this, I will discuss a bit about how you can (and should) use the <code>LIKE<\/code> operator in your <code>CHECK<\/code> constraints to strengthen your data integrity.<\/p>\n<h2>The LIKE Operator<\/h2>\n<p>The <code>LIKE<\/code> operator has been in SQL since the very beginning. It\u2019s a very weak form of pattern matching, that has only two simple wildcards. The wildcards are &#8216;<code>_<\/code>&#8216; (underscore) and &#8216;<code>%<\/code>&#8216; (percent). They respectively stand for any single character and for zero or more characters in a substring. The syntax is pretty straightforward:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\"> &lt;target string&gt; [NOT] LIKE &lt;pattern string&gt; \n                             [ESCAPE &lt; character&gt;]<\/pre>\n<p>Both parameters have to be expressions of type <code>CHAR(n)<\/code> or <code>VARCHAR(n)<\/code>. People who are new to SQL will sometimes try to match the pattern to a temporal or numeric data type. If you really must do this for some reason, please remember that you first need to <code>CAST()<\/code> them to a character data type. This is a habit probably left over from COBOL or other languages that don\u2019t have a rich set of data types.<\/p>\n<p>People who are new to databases and SQL also forget that we have functions <code>UPPER()<\/code> and <code>LOWER()<\/code> to change the case of letters. You can use them in the <code>INSERT INTO<\/code> statements guarantee that your data follows \u201cBrent\u2019s Rule\u201d which states that you should \u201cstore data the way you use it and use it the way it\u2019s been stored!\u201d<\/p>\n<p>If the <code>&lt;pattern string&gt;<\/code> matches the <code>&lt;target string&gt;<\/code>, the result of this expression is <code>TRUE<\/code>, otherwise it\u2019s <code>FALSE<\/code>. But if either the <code>&lt;pattern string&gt;<\/code> or <code>&lt;target string&gt;<\/code> is <code>NULL<\/code>, then the result of this expression is <code>UNKNOWN<\/code>. Please remember that a <code>NULL<\/code> is <em>not<\/em> the same as an <em>empty string<\/em>. An empty pattern and the empty target defaults to <code>TRUE<\/code>. Trivially, every string has a minimum substring of length zero.<\/p>\n<p>Another confusion is that the LIKE operator is not equality. In SQL, when you compare two strings for equality, you pad the shorter string with trailing blanks, so that <code>'abc'<\/code> is not LIKE &#8216;<code>abc   '<\/code>, but they are <code>=<\/code> (equal).<\/p>\n<p><em>Note:Not all programming languages do this! For example, the Xbase family truncates the longer string. Some languages are case insensitive, so \u2019abc\u2019is equal to \u2018ABC\u2019 in them. This is one reason why ISO and ANSI standard encoding schemes specify a fixed length, and if letters are used, the case and position of those letters is also specified (usually uppercase). Another reason is so you know how to layout display screens and paper forms that could be messed up by varying length strings.<\/em><\/p>\n<p>The percent sign as a wildcard works in a straightforward manner. It doesn\u2019t usually occur in the middle of a word or text string; it\u2019s got a clear and easy to see shape and it is not part of the shared punctuation marks used in all the alphabets that have Unicode support.<\/p>\n<p>The use of underscore as a wildcard is unfortunate. It\u2019s not used in the other regular expression tools. Back when we had <a href=\"https:\/\/www.techopedia.com\/definition\/3602\/impact-printer\">impact printers<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Line_printer\">green bar paper<\/a> (links provided if you were unaware of such a thing in the past!), it was easy to tell four underscores from five because the printer alignment was that bad. But then we got good printers and laser technology.<\/p>\n<p>They all became a single line and it was hard to try and count them. To demonstrate my point, consider \u201c_______\u201d versus \u201c____\u201d; pretty hard to see a different difference, isn\u2019t it? And pretty much impossible to count them without clicking a key over and over and saying 1, 2, 3, \u2026<\/p>\n<p>Add to the problem, the underscore character_is allowed in data element names in SQL and other programming languages. In fact, it\u2019s recommended to separate multiple words in a single data element name in several ISO standards. That means we like writing things like <code>\"employee_name<\/code>&#8220;; after a few centuries of writing on lined paper, we expect Latin letters to fall on the baseline and it\u2019s measurably easier for us to read. Other conventions like mixed case, hyphens, and CamelCase actually make your I jump, so you get errors and slow down the reading of the text.<\/p>\n<p>We inherited these two tokens from IBM when we are working on the standard but I honestly do not know their history. If anyone can comment on it, I\u2019d like to hear.<\/p>\n<p>The optional <code>ESCAPE<\/code> clause is seldom used, but when you needed., it is very handy. The character that follows it is given a special meaning that says whatever appears after the escape token in the target is to be taken as a literal. In practice, the most common characters used are \\ and \/, they are easy to see, and are not much used in other strings. This is how your search string can include actual percent and underscore symbols without treating them as meta-symbols.<\/p>\n<h2>Enforcing Constraints Versus Searching for a Pattern<\/h2>\n<p>As I have shown in the previous section, these two tokens actually do a pretty good job, if you are searching for a pattern in a character column or expression. You can accept false positives in many searches. For example, consider this simple search:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT emp_first_name, emp_dept\nFROM Personnel\nWHERE emp_first_name LIKE 'Jo%';<\/pre>\n<p>It will return all the employees like <code>'Joe'<\/code>, <code>'John'<\/code>, and <code>\u2018Joan'<\/code> . But it will also return weird stuff like <code>'Jo Jo The Jungle King'<\/code>! Unless you happen to be a fan of Golden Age jungle comics, you probably didn\u2019t want this particular fictional character as one of your employees.<\/p>\n<p>You can use a <code>LIKE<\/code> pattern in a check constraint to keep some data out. Remember there\u2019s a difference between validation and verification; any five digits is a valid ZIP Code, but not all five digits numbers have been issued as ZIP Codes. you must verify your ZIP Codes by doing table lookup on a database from the post office.<\/p>\n<p>Unfortunately, the original two meta characters are not very good when you try to validate a column. They are both weak and too general. To fix this, the SQL 2016 Standard borrowed something from grep(). The syntax is:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">[&lt;list of characters&gt;]  |  ^[&lt;list of characters&gt;]<\/pre>\n<p>Any one of the characters enclosed in square brackets is to be matched against a character in the target. It\u2019s a generalization of the original underscore. The <code>^<\/code> is like the <code>NOT<\/code> in other SQL constructs; it means that none of the characters in the target is in the list. Remember that a blank is a character, so your list may or may not want to include a blank. The usual conventions for all apply here too.<\/p>\n<p>There is a shorthand for a list of characters to save you typing. <a id=\"post-102729-_Hlk169537231\"><\/a><code>'a-z'<\/code> means all the lowercase Latin letters. <code>'A-Z'<\/code> is shorthand for all the uppercase letters. Finally, <code>'0-9'<\/code> is the shorthand for the 10 basic digits. You can do some other things with the embedded-indicate a range but be careful because coalitions and Unicode character sets can affect things. (For example, in a typical SQL Server database, the DBA has likely created the database to be case insensitive by default. So &#8216;A-Z&#8217; will include the lowercase as well.)<\/p>\n<p>Again, write, clean portable code and do not try to use the fancy features in one release of one SQL product. Even when you have no plans to ever change platforms, some day those plans may be made for you and then you have to deal with the differences.<\/p>\n<p>Consider the simple problem of making sure that a ZIP Code is exactly 5 digits.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE TABLE Personnel\n(\u2026\n emp_zip_code CHAR(5) CHAR(5) NOT NULL\n CONSTRAINT valid_zip\n    CHECK (emp_zip_code LIKE \u2018[0-9][0-9][0-9][0-9][0-9]\u2019),\n ..);<\/pre>\n<p>Put this simple <code>CHECK()<\/code> constraint in your DDL, so you do not have to put it in the application code (or at least you can be 100% sure that the data is always 4 numeric values even if it is done in the UI!). You just protected your SQL today, the SQL you are writing in the future. Not a bad payoff for a few lines of typing!<\/p>\n<p>There are a lot of places where this is useful. For example, if you want to make sure a string has at least one character in it: <code>column_name like '_%'<\/code> will handle this need. Any time you can keep your data clean, the better.<\/p>\n<h2>A bit more history<\/h2>\n<p>In an outdated version of the SQL Standard, there was the keyword <code>SIMILAR TO<\/code>, which introduced a pattern match in the sense of POSIX conventions. The reason that it showed up was that POSIX was under consideration as a federal standard. Basically, this version of pattern matching used \u201c[:&lt; keyword&gt;:]\u201d where the &lt;keyword&gt; could be the names of various common regular expressions such as <code>DIGIT<\/code>, <code>ALPHA<\/code> or <code>ALPHANUMERIC<\/code>. This got around problems with character sets but was not so good for portability.<\/p>\n<p>It can get even worse. Some implementations offer other pattern matching in this sense, but use arbitrary keywords like <code>REGEXP<\/code>, <code>REGEXP_MATCHES<\/code>, <code>REGEXP_LIKE<\/code> or operators like <code>'~'<\/code> to activate the feature. You could also activate XQuery semantics with the keyword <code>LIKE_REGEX<\/code> to XQuery 1.0 \/ XPath 2.0 function <code>fn:matches()<\/code>.<\/p>\n<p><em>Editor note: SQL Server in Azure and presumable on prem will support REGEXP_LIKE and others in the future. More info <\/em><a href=\"https:\/\/devblogs.microsoft.com\/azure-sql\/introducing-regular-expression-regex-support-in-azure-sql-db\/\"><em>here<\/em><\/a><em>. PostgreSQL already does, and includes SIMILAR TO. More info <\/em><a href=\"https:\/\/www.postgresql.org\/docs\/current\/functions-matching.html\"><em>here<\/em><\/a><em>.<\/em><\/p>\n<h2>More complex searching<\/h2>\n<p>The SQL standard is complemented by ISO\/IEC 13249 SQL Multimedia and Application Packages. Part 2 Full-Text defines routines and methods for information retrieval in text documents. It deals with objects like word, phrase, sentence, and paragraph. There are special constructs; to search for words or phrases, search words within a definable distance, thesaurus features like broader terms of a word, <code>soundex<\/code>-similarity, ranking of results, and much more.<\/p>\n<p>The central method of these features is <code>CONTAINS<\/code>. Here is one example:<code><\/code><\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">-- Retrieve rows with 'word_1' and 'word_2' \n-- (in this order) within a sequence of 10 words.\nSELECT *\nFROM Foobar\nWHERE CONTAINS(' \"word_1\" NEAR \"word_2\"\nWITHIN 10 WORDS IN ORDER ') = 1;<\/pre>\n<p>Such features are far beyond regular SQL pattern matching. To perform efficiently, they will require special indexing as found in text bases. SQL is simply the wrong tool for this job! We never intended for SQL to be a text base or general document tool.<\/p>\n<h2>Summary<\/h2>\n<p>In this article, I have shared a lot about comparisons that are based on part of string values, but to bring it back to the main topic, it was always intended for strings to be checked and validated. Please don\u2019t disappoint those of us that worked so hard on this and use <code>LIKE<\/code> expressions as often as they are useful to make sure strings meet your desired needs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, I will cover a bit about the LIKE operator, including how it works, and a bit of history about why it is like it is. After establishing this, I will discuss a bit about how you can (and should) use the LIKE operator in your CHECK constraints to strengthen your data integrity&#8230;.&hellip;<\/p>\n","protected":false},"author":96214,"featured_media":103458,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143539],"tags":[159072],"coauthors":[6781],"class_list":["post-102729","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-featured","category-theory-and-design","tag-database-theory"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102729","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/96214"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=102729"}],"version-history":[{"count":3,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102729\/revisions"}],"predecessor-version":[{"id":103459,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/102729\/revisions\/103459"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/103458"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=102729"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=102729"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=102729"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=102729"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}