Nesting Levels in SQL

The 'Structured' part of SQL denotes the fact that queries can be nested inside each other in such a way that, wherever you can use a table, you can use a table expression. Such derived tables can provide powerful magic, to which is added CTEs and Lateral Tables. Joe Celko explains.

The ancestor of SQL was an IBM project named Sequel, which was a shorthand for “Structured English Query Language”; the term “Structured” in the name came from two sources. The first was the “Structured Revolution” at the time. Dijkstra, Youron, DeMacro and the other early pioneers of modern Software Engineering had just given us “Structured Programming” and “Structured Analysis” so everything had to be Structured. Younger people will remember when everything had to be “Object Oriented”, then “Big Data” then “in the Cloud” and whatever the current fad is at this reading.

And for the record, the ANSI/ISO Standard language is called “Ess Que El” and not “SEQUEL” in spite of the fact that we all screw up. The informal rules in the Standards world is that an ISO Standard made up of letters is spelled out, but a US Federal standard is pronounced as a word, no matter how weird. his is why a Physician’s Standards Review Organization is called a “Piss Row” in Medicate/Medicaid terminology. Another rule is that the French will fight to keep the initials of a standard in French order, not English. Did you know that “ISO” is actually named “International Organization for Standardization“?

The second source of the “structured” term was the ability to nest queries inside each other via joins, set operations and other table level operations. This is actually a mathematical property called Orthogonality. In English, it is why operations done on numbers produce numbers, so you can use parentheses to write mathematical expressions. Any place in which you can use a number, you can use a numeric expression. Any place in which you can use a table, you can use a table expression.

In SQL there is a hierarchy of data in which the outer level is the schema or database. The database is made up of tables (which can be base or virtual tables). Each table is made up of a set of rows. These rows have no ordering, but all have the same structure, so it is proper set. Each row is made up of columns. The columns are scalar values drawn from a domain; a domain is set of values of one data type that has rules of its own. These rules include the type of scale upon which it is built and reasonable operations with it.

I tell people to use set diagrams (improperly called Venn Diagram; Euler invented them) when they want to use a doodle to help them think of a query. But most programmers doodle flow diagrams because they grew up with flowcharts, DFDs and similar mind tools. Let me show you what I mean.

Derived Tables

A derived table is a table expression embedded in a containing statement. It has to be placed inside parentheses and it can optionally be given a correlation name and its columns can also optionally given names.

The derived table will act as if it is materialized during the duration of the statement that uses it. Notice the phrase “act as if” in that last sentence. The optimizer is free to re-arrange the statement anyway that it wished so long as the results are the same as the original statement. If there is no (<derived column list>), then the derived table exposes the tables and their columns that created the derived table.

Once the query expression becomes a derived table with a correlation name, the internal structure is hidden. If no (<derived column list>) is given, then it inherits the column names from the component tables. Watch out; if the expression has two or more tables that have common column name, you have to to use <table name>.<column name> syntax to avoid ambiguity.

Materialization is not an easy choice. If one statement is using a derived table, it might be better to integrate it into that statement like a text macro. But if many statements are using the same derived table, it might be better to materialize it once, put it in primary or secondary storage and share it. This is the same decisions the SQL engine had to make with VIEWs. But the derived tables are not in the schema level where the optimizer can find them and keep statistics about them. It takes a pretty smart optimizer to filter them out for materialization.

This is why it is better to put a derived table definition into a VIEW when it is re-used often.

Column Naming Rules

Derived tables should follow the same ISO-11179 Standard naming rules. A table is a table. The keyword “AS” is not required, but it is a good programming practice and so is naming the columns. If you do not provide names, then the SQL engine will attempt to do it for you. The table name will not be accessible to you since it will be a temporary internal reference in the schema information table. The SQL engine will use scoping rules to qualify the references in the statement — and what you said might not be what you meant. Likewise, columns in a derived table inherit their names from the defining table expression. But only if the defining table expression creates such names. For example, the columns in a UNION, EXCEPT or INTERSECT statement have no names unless you use the AS clause.

When you have multiple copies of the same table expression in a statement, you need to tell them apart with different correlation names. For example, given a table of sports players, we want to show a team captain and team co-captain.

1930-img9B.jpg

I have found that using a short abbreviation and a sequence of integers for correlation names works very well. This also illustrates another naming rule. The player’s last name is used in two different roles in this query, so we need to rename the column to the role name (if it stands by itself without qualification) or use the role name as a prefix (i.e. use “boss_emp_id” and “worker_emp_id” to qualify each employee’s role in this table).

Scoping Rules

A derived table can be complete in itself and without a scoping problem at all. For example, consider this query:

1930-img98.jpg

The derived table “B” has no outer references and it can be retrieved immediately while another parallel processor works on the rest of the query. Another form of this kind of derived table is a simple scalar subquery:

The scalar subquery is computed; the one row, one column result table is converted into a unique scalar value and the WHERE clause is tested. If the scalar subquery returns an empty result set, it is converted into a NULL. Watch out for that last case, since NULLs have a data type in SQL and in some weird situations you can get casting errors.

When a table expression references correlation names in which they are contained, you have to be careful. The rules are not that much different from any block structured programming language. You work your way from the inside out.

This is easier to explain with an example. We have a table of pilots and the planes they can fly (dividend); we have a table of planes in the hangar (divisor); we want the names of the pilots who can fly every plane (quotient) in the hangar. This is Chris Date’s version of Relational Division; the idea is that a divisor table is used to partition a dividend table and produce a quotient or results table. The quotient table is made up of those values of one column for which a second column had all of the values in the divisor.

1930-wp8t5nbz.jpg  

The quickest way to explain what is happening in this query is to imagine a World War II movie where a cocky pilot has just walked into the hangar, looked over the fleet, and announced, “There ain’t no plane in this hangar that I can’t fly!”, which is bad English but good logic.

Notice that PilotSkills appears twice in the query, as PS1 and as PS2. Go to the innermost "SELECT.. FROM..” construct. We have a local copy of PilotSkills as PS2 and outer references to tables H and PS1. We find that H is a copy of the Hangar table one level above us. We find that PS1 is a copy of the PilotSkills table two levels above us.

If we had written “WHERE pilot_name = PS2.pilot_name” in the innermost SELECT, the scoping rules would have looked for a local reference first and found it. The search condition would be the equivalent of “WHERE PS2.pilot_name = PS2.pilot_name“, which is always TRUE since we cannot have a NULL pilot name. Oops, not what we meant!

It is a good idea to always qualify the column references with a correlation name. Hangar did not actually need a correlation name since it appears only once in the statement. But do it anyway. It makes the code a little easier to understand for the people that have to maintain it — consistent style is always good. It protects your code from changes in the tables. Imagine several levels of nesting in which an intermediate table gets a column that had previously been an outer reference.

Exposed Table Names

The nesting in SQL has the concept of an “exposed name” within a level. An exposed name is a correlation name, a table name that is not followed by a correlation name, or a view name that is not followed by a correlation name. The exposed names must be unique.

Here are some examples to demonstrate scoping rules.

1930-img8A.jpg

 Tables A and B can be referenced in the outer WHERE clause. These are both exposed names.

1930-img8B.jpg

But only Table X can be referenced in the outer WHERE clause. The correlation name X is now an exposed name.

1930-imgA3.jpg

Table C is not exposed to any other SELECT statement.

Common Table Expressions (CTE)

ANSI/ISO Standard SQL-99 added the Common Table Expression or CTE. It is also a query expression that is given a name, just like a derived table. The difference is that they appear before the SELECT statement in which they are used.

They are just a very useful bit of syntactic sugar. The only thing to remember is that each CTE is exposed in the order they appear in the WITH clause. That means the n-th CTE in the list can reference the first thru (n-1)-th CTEs. 

LATERAL Tables

If you have worked with blocked structured procedural languages, you will understand the concept of a “Forward Reference” in many of them. The idea is that you cannot use something before it is created in the module unless you signal the compiler. The most common example is a set of co-routines in which Routine A calls Routine B, then Routine B calls Routine A and so forth. If Routine A is declared first, then calls to B have to have an additional declaration that tells the compiler Routine B will be declared later in the module.

=”font-size:12.0pt;color:black”>

=”font-size:12.0pt;color:black”>=”font-size:12.0pt;color:black”>

Notice that the Departments table appears as D1 and D2 at two levels — D1 is at level one and D2 is a level three.

The following example is not valid because the reference to D.dept_nbr in the WHERE clause of the nested table expression references the Personnel table via P.dept_nbr that is in the same FROM clause.

To make the query valid, we need to add a LATERAL clause in front of the subquery. Notice the order of Personnel and Departments. This exposes the department number so that a join can be done with it in the derived table labeled E.

 1930-img9C.jpg

+

=”mso-bidi-“>

Let me finish with some heuristics.

  1. Use aliases that tell the guy maintaining this code what tables are referenced. Do not simply use  A,B,C, .. as names. This comes from the 1950’s when files were referenced by the name of the tape drive that held them.
  2. Petty print” the SQL so that the indentation tells the reader what the query structure is. This is a simple task for any of many tools.
  3. Never use “SELECT *” in production code; it will eventually bite you when a table changes. You have a text editor and it is easy to cut & paste a list of column names.
  4. When you have three or more levels of nesting, think about a CTE. The code will be more readable.
  5. When you have two or more references to a query expression think about a CTE. The code will be more readable, but in many SQLs it will also help the optimizer make a decision about materialization (T-SQL is still behind on this).

=”mso-bidi-“>