29 March 2024

1816 views

7 0

SQL Logic

In this article, I will discuss the history and thinking behind several types of logic that are typically associated with writing relational database code. Because of how relational databases implement NULLs, it definitely can make some processes more complex than expected by may programmers (inexperienced and experienced alike!)

Three Valued Logic

Missing or UNKNOWN values have been a problem in data modeling and statistics long before there was SQL. One of the best solutions has been to create a category for “miscellaneous”, “UNKNOWN”, “N/A (Not applicable), “other”, “TBD (to be determined)” and so forth. When Dr. Codd was designing the relational model, he decided on using the marker NULL for all the possible ways that something could be UNKNOWN or missing. Later, however, he decided to add markers for “values that are missing because the attribute is missing.” and “values that are missing because of our ignorance of an existing attribute”, a distinction that was not implemented in ANSI/ISO Standard SQL.

Other programming languages are usually based on a simple Boolean algebra. Logical values are TRUE and FALSE, and the three operators are AND, OR and NOT. Unfortunately, SQL is not quite that simple because of the intricacies of representing the data that user needs to store. We had a three valued logic (3VL) that adds the logical value UNKNOWN and slightly different rules.

As Dave McGoveran (one of the pioneers of relational databases) pointed out, technically this is not a logic system. To be a logic system, we would need inference rules. This is perhaps something only a math major really cares about, but we have what are called a <search condition> in the BNF that defines Standard SQL. While mathematicians may worry about this sort of thing, programmers don’t have to. SQL is going to give you enough problems. The purpose of logical operators is to test the logical value of a condition and return a logical value that can be TRUE, FALSE, or UNKNOWN.

An inference rule in two valued logic uses the symbol ⇒, which is usually read as “<left expression> Implies <right expression>” or as “if A is TRUE, then B is TRUE.” In English, or perhaps I should say, in more traditional English, “a TRUE premise cannot imply a FALSE conclusion” when discussing logic. The transformation in two valued, Boolean logic called the Smisteru Rule says that (A ⇒ B) Is equivalent to (¬A ∧ B). Unfortunately, a direct translation from two valued logic to SQL three valued logic does not work. Remember that in the DDL treats UNKNOWN and FALSE similar, while the DML treats UNKNOWN and TRUE as being similar.

The UNKNOWN value is the result of using NULLs in expressions and in some relational operations, such as OUTER JOIN. It can be important to remember that UNKNOWN (logical value) and NULL (property of an attribute) are not the same thing, but they look a lot alike when you are coding.

Another quote from Dave McGoveran is that a computer language committee never met a feature that it didn’t like. This is also all too TRUE. The number of features in a language tend to increase, not shrink.

Here is a quick table of SQL s basic three valued logic operators.

Op1	Op2	Op1 AND Op2	Op1 OR Op2
TRUE	TRUE	TRUE	TRUE
TRUE	FALSE	FALSE	TRUE
TRUE	UNKNOWN	UNKNOWN	TRUE
FALSE	TRUE	FALSE	TRUE
FALSE	FALSE	FALSE	FALSE
FALSE	UNKNOWN	FALSE	UNKNOWN
UNKNOWN	TRUE	UNKNOWN	TRUE
UNKNOWN	FALSE	FALSE	UNKNOWN
UNKNOWN	UNKNOWN	UNKNOWN	UNKNOWN

AND and OR operators are symmetric: you can swap Op1 and Op2 without changing the operator outcome.

Op1	NOT Op1
TRUE	FALSE
FALSE	TRUE
UNKNOWN	UNKNOWN

The final classic 2VL operator is negation or NOT<expression>. In 2VL, it switches TRUE to FALSE and FALSE to TRUE; no surprises there. However, it doesn’t change UNKNOWN in 3VL. We didn’t find out if the expression was TRUE or FALSE, so it stays UNKNOWN.

There are some other urinary operators and ANSI/ISO Standard SQL might not want to use them often, but it’s good to know they exist:

<search condition> IS [NOT] {TRUE | FALSE | UNKNOWN}

This is a test for logical value. It is not the same as the more familiar <search condition> IS [NOT] NULL that we’ve had in SQL since the beginning. Starting to appreciate how messy 3VL can be?

Generalized Conjunction and Disjunction

Conjunction and Disjunction are mathematician’s words for what a programmer would call ANDs and Ors. Most programming languages define them as a binary operation, but we can generalize them to a more set oriented parameter in SQL.

Before modern logic and the use of truth values, classic Aristotelian syllogisms used the forms “All X are Y”, “Some X are Y”, and “No X are Y”; find a good write up this in any non-mathematical philosophy book. We have a version of them in SQL using what we call quantifiers. The syntax for the existential quantifier is pretty straightforward;

1	<value expression> [SOME \| ANY] <comparison operator> <table expression>

The keywords SOME and ANY are synonyms, but you might want to pick the one that makes the most sense is an English sentence you write your code. This is essentially a shorthand for a list of OR-ed predicates built from the expansion of of code

(<value expression> <comparison operator> <row-1>

OR <value expression> <comparison operator> <row-2>

...

OR <value expression> <comparison operator> <row-n>)

You don’t think about it, but if you started with Pascal as your first programming language, you might remember the <exp> IN (expression list>) predicate. Yes, this is where SQL stole it from, but we define it as <value expression> = ANY VALUES (exp-1,exp-2,..exp-n>). This is the general disjunction.

Since this is really shorthand for OR-ed <search condition>s, a single TRUE result will make the whole predicate TRUE. But more than that, if the expression list is all NULLs or empty, then you can’t get a TRUE or FALSE result?

Likewise, we can build a generalized conjunction by just changing the OR to AND in the expansion of the code. This is the SQL version of the universal quantifier.

(<value expression> <comparison operator> <row-1>

AND <value expression> <comparison operator> <row-2>

...

AND <value expression> <comparison operator> <row-n>)

Everyone uses the IN() predicate, but the ANY and ALL are very handy as well. They can make your SQL more English like an easier to read. Since they are a single construct, your SQL product may have optimizations for it.

BETWEEN

The original version of the BETWEEN search condition was a very simple shorthand; <value-1> BETWEEN <value-2> AND <value-3>

This expression is equivalent to the search condition:

<value1 >= <value-2> AND <value-1> <= <value-3>

Like most SQL operators, you also had a negation of the BETWEEN predicate:

<value-1> [NOT] BETWEEN <value-2> AND <value-3>

That is equivalent to the search condition:

NOT(<value->1 BETWEEN <value-2> AND <value-3>)

This concept has some interesting history. Between-ness was left out of Euclidean geometry. It later shows up we get to the calculus and must define convergence series and limits. It’s also one of the few comparison operators that has three parameters.

This operator has an ordering from left to right and we know (<value-1> <= <value-3>). The ISO model temporal value is based on half open intervals. That means we know where the interval starts, but the end point is not included in the end range. This is just like how one define a day. It starts at midnight of one day (0:00) and ends on midnight of the next. The starting time of 0:00 is included in the interval, but the ending point time is defined as 0:00, but it actually is 23:59 59:59. The end time is not in the interval to make it easier to specify in a manner that is round off proof. (for a more in depth discussion, see the “ISO Half Open Interval Model section in my Modeling Time article.)

The advantage of half-open intervals is that they can be concatenated, subsseted and defined in calculus as limits.

Decades ago, when the SQL standards were being worked on in ANSI X3H2, we defined what is now the SYMMETRIC BETWEEN predicate one meeting. Microsoft even put it in their Access database product. At the next meeting, the committee backed out this change. Currently, however, we reverse ourselves again and use a new optional keyword, SYMMETRIC. This is implemented in PostgreSQL, but not currently in SQL Server.

<value-1>[ BETWEEN SYMMETRIC<value-2> AND <value-3>

as a shorthand for:

((<value-1>BETWEEN <value-2> AND <value-3>)
OR <value-1> BETWEEN <value-3> AND <value-2>))

Notice that we still don’t have the support for open intervals with this operator.

[NOT] EXIST <subquery>

The simple EXISTS() also, as an interesting history. Technically, it takes a subquery. This means were supposed to have a SELECT FROM statement parameter. But the SELECT clause is redundant.

The reason that it’s in syntax is that originally the first attempts at SQL would expand the subquery out completely to test for the existence qualified data. But if you think about it, the minute you get just one row back, even if it includes NULLs (yes, NULLs exists, even if they don’t have any value), this operator will return TRUE in fact, it’s one of the few logical operators in SQL that cannot return an UNKNOWN.

Pattern Matching

The syntax for the original pattern matching predicate is < match value> [NOT] LIKE <pattern expression> [ESCAPE <escape character>]. SQL got this version of a general regular expression parser from IBM. Since we never intended the language to be a text handling tool, we picked a simple basic version of grep (general regular expression parser) from UNIX. Later, SQL added the SIMILAR TO regular expression tool.

There are several wildcards that can be a part of the pattern expression:

The percent sign % represents zero, one, or multiple pattern characters.
The underscore sign _ represents one, single character. This was really a bad choice are part because a string of underscores is hard to count. Modern word processors and printers tend to make them all into one single line with the resolution that we have had since we stopped using old unit record line printer equipment.
Square brackets [ and ] allow you to specify a range of characters like ‘[0-9]’. So, any single character that appear in the character set you are using that are between 0 and 9 would match. In this case, any of 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
The ESCAPE clause declares a pattern character with the property that what follows it is to be treated as a literal, not as a pattern. This lets you put underscores and percent signs in strings. This feature does not get used a lot.
As you would expect, a NULL in the predicate always returns an UNKNOWN result. However, if both pattern and match values are strings of length zero, then the result is TRUE.

The LIKE operator is not used as much as it should be. You can put it in a CHECK() constraint in the DDL and save yourself the trouble of writing code to do a quick validation every time you touch a particular column. My favorite is simply having CHECK (foobar_zipcode LIKE ‘[0-9][0-9][0-9][0-9][0-9]’) on US mailing addresses because it’s cheap and easy.

The SQL–99 standard added the SIMILAR TO operator, this is a form of grep() based on POSIX, as found in ISO/IEC 9945 standards. The US government was interested in these standards at the time, so we added it to SQL for portability.

Pattern symbols are little more extensive in this regular expression tool. The following are some additional patterns that are part of the standards but not necessarily completely implemented in the RDBMS you are using.

| Means alternation (batch to either of the two alternatives separated by the bar)
*means repetition of the previous character zero or more times
+ means repetition of the previous character one or more times.
Parentheses can be used to group substrings into a single unit.

There’s also a set of abbreviations that look like this.

[: ALPHA:] match any alphabetic character.
[: UPPER] managing the uppercase alphabetic character.
[: LOWER:] match any lowercase alphabetic character.
[: DIGIT:] match any digit.
[:ALNUM:] match any alpha or numeric character.

Vendors will have even more extensions in their products, so you can really write pretty elaborate pattern matches, which frankly is usually a bad idea. SQL was never meant to be a string handler. It’s probably better if you do this in the input tier of your system, using the tool designed for this. It’s also better if you keep the patterns as simple as possible. Think about the five digits of the US ZIP Code system and then compare it to the elaborate pattern recognition that the British Postal Code uses. The British system was based on abbreviations for post offices that no longer exist, so it’s a rather cryptic long string can be expressed as a monstrous, long regular expression. The Royal Mail is gradually abandoning the old system in favor of an all-numeric, fixed length system like ZIP Codes for bulk mail users.

Conclusion

Originally, SQL evolved from an IBM product called Sequel. The name stood for “Structured English–like Query Language” which defines the goal the language. Procedural languages tend to have a relatively small set primitive operators. When you want to do something, you code a procedure from them.

However, there is a tendency in declarative languages to create another language feature and hide procedural part inside them. The idea is to abstract operations to a higher level and let the machine do the heavy lifting. For example, if I need to sort a file in a procedural language, I must pick my algorithm. In a higher-level declarative language, I simply tell the language to SORT (< file name>) and the software picks the algorithm, based on an analysis of the file.

In SQL, we want to use the logic operations to define what we want and not be concerned with the how we get it. We have a pretty good toolbox for doing this too!

SQL Logic

Three Valued Logic

Generalized Conjunction and Disjunction

BETWEEN

[NOT] EXIST <subquery>

Pattern Matching

Conclusion

Subscribe for more articles

Rate this article

Joe Celko

Related articles

Department of Redundancy Department

Database Subsetting and Data Extraction

Tags