The post When an update doesn’t update appeared first on Simple Talk.
]]>I changed data from my application, but when I checked the database, I couldn’t see the change!
I’ve seen this. Loads. It can be quite perplexing for folks because they expect to see an error message if the insert, update, or delete failed. I put this post together to provide some things you can investigate if this happens to you – you are sure that you updated the data, but when you check using SQL Server Management Studio (SSMS), your change isn’t there. For the remainder of the post, I’m going to use the word “update” to mean any change to the data, even MERGE
{shudder}.
The causes behind this issue can usually be lumped into three categories:
Most commonly, the problem is that app is updating a table that is in a different location than the one you’re checking. Some examples of things you should verify:
LOCALHOST
and LOCALHOST\SQLEXPRESS
are not the same, and if the app is running on a different machine or a different domain, it may not be safe to trust identical DNS.AttachDbFileName
features, since even if all the other connection string attributes are the same, you are definitely looking at two independent copies of the database.DATALENGTH
that all of the data is there, or RIGHT
to verify that the end of the string is intact. See Validate the contents of large dynamic SQL strings for more information, and have a generous look around the maximum characters settings in Tools > Options > Query Results > SQL Server > Results to Grid (max 64K) | Results to Text (max 8K).…but, sometimes, it’s caching. Check if any of the following could be true:
NOLOCK
and are seeing an earlier (or perhaps even invalid) version of the row.Sometimes the issue is that we assume the statement succeeded simply because we didn’t see an error message. This is not always a safe assumption! Not all errors bubble up to the caller, for example:
INSTEAD OF
trigger that simply didn’t end up performing the update, in which case the application wouldn’t even have an exception to ignore.TRY/CATCH
or other error handling / rollback mechanisms in your code (in the query or in the application). CATCH
could easily be ignoring the exception or raising a generic exception of a lower severity, or a rollback could be happening without raising any exceptions.There are many reasons why it may seem like an update succeeded, but validation suggests it didn’t. Usually it is the case that either you weren’t checking the right place, you checked too quickly, or there was a failure. Hopefully the above gives a healthy set of things to check if you are ever in this scenario.
The post When an update doesn’t update appeared first on Simple Talk.
]]>The post The NTILE Function appeared first on Simple Talk.
]]>NTILE()
is a window function that allows you to break a table into a specified number of approximately equal groups, or <bucket count>
. For each row in a grouping, the NTILE()
function assigns a bucket number representing the group to which the row belong starting at one.
The syntax of the NTILE()
function is:
NTILE(<bucket count>) OVER (PARTITION BY <expression list> ORDER BY <sort parameter list>)
The <bucket count>
is a literal positive integer or an expression that evaluates to a positive integer. Most of the time is an integer constant but using the option of an expression can be a handy trick.
The PARTITION BY
clause divides the result set returned from the FROM
clause into partitions to which the NTILE()
function is applied. This is the way this subclause is used inside the OVER ()
clause in other windowed functions.
The ORDER BY
clause specifies the order of rows in each partition to which the NTILE()
is applied. Each parameter in the list can have an optional sort order attached to it with the optional [ASC
| DESC
] postfix. This is the way this subclause is used inside the OVER ()
In other windowed functions. This ordering is how The rows of the table are scanned to pick up groups.
If the number of rows in the results at is divisible by <bucket count>
, the rows are divided evenly among groups. But if the number of rows is not divisible by <bucket count>
, the NTILE()
function results in groups of two sizes. The larger group always come before the smaller groups in the order specified by the ORDER BY
clause.
Let’s create some sample data And see how this works. For example, we’re going to do a pretty small table, but it’s worth mentioning that when you’re doing an NTILE
problem in the real world, you generally want a large population. Small sample sizes lead to badly sized groups.
The following statement creates a new table named population that stores 10 integers from one to ten:
BEGIN CREATE TABLE Population (pop_id INTEGER NOT NULL PRIMARY KEY); INSERT INTO Population(pop_id) VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10); END;
As a simple example, let’s use two for the bucket count:
SELECT pop_id, (NTILE (2) OVER (ORDER BY pop_id)) AS grouped_pop FROM Population;
This returns two groups of data as you can see in the grouped_pop
column.
pop_id grouped_pop
----------- --------------------
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2
9 2
10 2
The first thing we need to get out of the way, is to discuss what this function is not. Too often programmers who use it for the first-time attribute properties to which it doesn’t have.
The NTILE()
function attempts to make all the buckets exactly the same size, but that’s not always possible with a given population. People often mistake this function for histogram, in which we might want buckets of different sizes. We know we have only a few billionaires in our population, and we know we have a big middle class and a relatively small number of truly poor people.
Another problem is that the attribute upon which you’re basing these buckets might not be very distinct. At the extreme, imagine that every employee has exactly the same salary amount. None of the NTILE()
groups based on salary can be expected to be better than any other group.
SQL does not have a built-in median aggregate function. Decades ago there were articles in trade magazines, Database Programming And Design and DBMS on how to write them in standard SQL (and back in 2009, I wrote this article on the subject here on Simple-Talk). We had various clever solutions, and it was a fun problem that bounced back and forth between the two magazines.
It’s very tempting to use the NTILE()
function to compute the median by finding the highest value in the first partition, the lowest value in the second partition, adding them and dividing by two. With the first 10 row example, we get (5 +6) / 2.0 = 5.5
for the median. Using this method. I’ll leave it to the reader to write this algorithm in one SQL statement.
Unfortunately, this does just does not work. There is no guarantee the first and second partitions will be the same size as extra values will be thrown into the first partition. If population wasn’t evenly divisible by two. It gets even worse. if you try dividing the population into three groups, the tops and bottoms are not the same size. The first group has four rows while the other groups have three rows. Ten does not divide evenly by three without a remainder, so NTILE(3)
would return the following in our previous query.
pop_id grouped_pop
----------- --------------------
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
A set is partitioned when all of the subsets in the partitioning union back into the original set, and the intersection of all of those subsets is empty. The first thing we need to do is to guarantee the value were putting into these groups is unique. That’s easy enough to do with a SELECT DISTINCT
operation. Then add a group number to those unique rows. Finally, find the minimum and maximum value within each group. This will give you a partitioning of the original data but remember that you’ve taken the samples based on the bucket size. There are a lot of assumptions being made here.
SELECT salary_grp, MIN(salary_amt) AS range_start, MAX(salary_amt) AS range_finish FROM (SELECT X.salary_amt, NTILE(5) OVER (ORDER BY X.salary_amt) AS salary_grp FROM (SELECT DISTINCT salary_amt FROM Personnel)) AS X(salary_amt) GROUP BY X. grp_nbr;
The Windowing functions in SQL pick a particular column or expression to reset their calculations. In effect, they repeat the actions of the rest of the function of the subsets formed by the partition expression. This is one of those things that, again, Is easier to show than to say.
Let’s imagine you got a table of candidates for Mensa, the high IQ society. To get membership, you have to have an IQ in the top 2% of the population. There is a list of acceptable IQ test that you can submit. The most common IQ tests are:
Candidates just must score high on one test to get membership But to play safe, candidates will submit more than one test score. Let’s say that our table looks like this (it references a table named IQ_Tests
that we will not create):
CREATE TABLE Mensa_Candidates ( candidate_name CHAR(25) NOT NULL, test_name CHAR(25) NOT NULL -- not including related tables in this example -- REFERENCES IQ_Tests (test_name), PRIMARY KEY (candidate_name, test_name), test_score INTEGER NOT NULL CHECK (test_score >= 0) );
Then I will create a bit of data to demo with:
INSERT INTO dbo.Mensa_Candidates ( candidate_name, test_name,test_score) VALUES ('Person 1','Test A', 100),('Person 9','Test A', 300), ('Person 1','Test B', 100),('Person 9','Test B', 130), ('Person 2','Test A', 130),('Person 10','Test A', 600), ('Person 2','Test B', 200),('Person 10','Test B', 200), ('Person 3','Test A', 300),('Person 11','Test A', 300), ('Person 3','Test B', 120),('Person 12','Test A', 440), ('Person 4','Test B', 133),('Person 13','Test B', 150), ('Person 5','Test A', 400),('Person 14','Test A', 320), ('Person 5','Test B', 100),('Person 14','Test B', 400), ('Person 6','Test A', 300),('Person 15','Test A', 300), ('Person 7','Test A', 130),('Person 15','Test B', 500), ('Person 8','Test B', 104),('Person 16','Test A', 600)
You could analyze candidates test scores by using NTILE()
in the following manner:
SELECT candidate_name, test_name, test_score, NTILE(10) OVER (PARTITION BY test_name ORDER BY test_score) AS test_ranking FROM Mensa_Candidates WHERE test_ranking = 10;
You can see from the output that
candidate_name test_name test_score test_ranking
--------------- ----------- ----------- --------------
Person 16 Test A 600 10
Person 2 Test B 800 10
This says were going to partition our data by the test names, in increasing order by test score, so we will get the best 10 percent of candidates for each test.
If a candidate was in more than one upper decile, then their name will appear more than once in the result set, once with each of the tests they took. We could then group the stable and find out how many people are qualified on more than one test. You can see this in the test data for Person 1 who falls in the lower 10 percent for each test:
Change the test_ranking
criteria to 1, and you will see the bottom 10 percent.
candidate_name test_name test_score test_ranking
--------------- ----------- ----------- --------------
Person 1 Test A 100 1
Person 2 Test A 130 1
Person 5 Test B 100 1
Person 8 Test B 104 1
While the NTILE()
function is not a complete statistical package in itself, you can quickly use it to explore your data without leaving SQL. The other window functions can also be quite useful. And it’s definitely worth taking a few days with some simple sample data to play with them and learn how to use them.
The post The NTILE Function appeared first on Simple Talk.
]]>The post First Normal Form Gets No Respect appeared first on Simple Talk.
]]>I was looking at some postings on a SQL newsgroup. The original poster wanted help with a problem he had storing International Classification of Diseases (ICD) values. Alternatively, to put it more accurately, he needed help fixing the problems he had created for himself. The non-table he was creating had a single text column that concatenated exactly four ICD codes.
Without going into much detail,
these are codes used by hospitals, insurance companies, and other places where we need a “Dewey decimal of death and despair” for reporting. They are a mixture of alpha and numeric characters, with minimum punctuation. It should come as no surprise because that is the standard format for many ISO standards. Thanks to Unicode, this limited set of symbols can be used with any alphabet or symbol system allowed in Unicode. The following regex pattern would match 2021 ICD-10-CM codes:
^(?i:[A-TV-Z][0-9][0-9AB](?:\.[0-9A-KXZ](?:[0-9A-EXYZ](?:[0-9A-HX][0-59A-HJKMNP-S]?)?)?)?|U07(?:\.[01])?)$
In English, ICD-10-CM codes consist of 3 to 7 alpha-numeric characters (case-insensitive), and codes longer than three characters have a decimal point between the third and the fourth character [1], e.g., “E11.9” represents Type 2 diabetes mellitus without complications. A list of ICD-10-CM codes can be found in the file icd10cm_order_YEAR.txt
available at the CDC ICD-10-CM web page (https://www.cdc.gov/nchs/icd/Comprehensive-Listing-of-ICD-10-CM-Files.htm). This file’s 7th to 13th characters (the 2nd column) correspond to the 1st to 7th characters of ICD-10-CM codes.
You do not need a complete understanding of regular expressions or ICD codes to follow this article, so don’t worry too much about it. The reason for posting the simplified regular expression was to scare you. My point was that this regular expression would be a pretty impressive CHECK
constraint on this column. Shall we be honest? Despite the fact that we know the best programming practice is to detect an error as soon as possible, do you believe that the original poster wrote such a constraint for the concatenated list of ICD codes?
I’m willing to bet that any such validation is being done in an input tier by some poor lonely program, in an application language. Even more likely, it’s not being done at all.
First Normal Form (1NF) says that this concatenated string is a repeated group, and we need to replace it with a proper relational construct. However, instead of getting some help on how to do it right, people posted various bits of code to slice up the original string! This is akin to the old engineering joke, “Don’t force it!; Get a bigger hammer!” This attitude is wrong in so many ways.
I’ve already hit on the first problem of Non-First Normal Form (NFNF) data; it is simply complicated. In the SQL standards, a “field” is defined as a part of a column that has some meaning by itself but is not a complete attribute. The classic example is that a date breaks down into (year, month, day) fields, and a timestamp breaks down into (year, month, day, hour, minute, seconds). You get the fields out of temporal datatypes with a statement like EXTRACT(<field name> FROM <temporal expression>
), or in SQL Server DATEPART
. Unfortunately, if you’re going to split up a NFNF column, you have to do all the work yourself.
SQL and the relational model are based on sets and not sequences. This means that (A, B, C) should be the same as (C, A, B) or any of the other six possible permutations. The original problem, the poster wanted to require four ICD codes crammed in the column. That means we have 4! = 24 permutations. Suddenly, a simple, equi-join is growing exponentially. One way around this would be to sort the fields within each string. This adds a little extra overhead and requires any change to a string to be recomputed. Think about an algorithm for finding all patients with a particular diagnosis. It might not be their primary diagnosis, so we must do a little scanning. This is sliding us back into the world of COBOL string-oriented data processing.
This poster made things even worse. He wanted the most severe diagnosis to appear first in the list. This means we are dealing with a combination and not a permutation. Hence, (A, B, C) <> (C, A, B), which makes comparisons and ordering a good bit harder. Now we need a process for raising the diagnosis. Unfortunately, it’s going to be rather individualized. We can probably all agree that being sucked through a jet engine is a severe medical problem. After that, is mild diabetes more of a problem than high blood pressure? It depends on the patient and the associated medical history. Since we were supposed to come up with four diagnoses, what if the only thing wrong with the patient was being sucked through a jet engine? Do we make up three more conditions? Create a dummy diagnosis. Repeat the jet engine diagnosis three times?
Programming languages have had a formal basis, such as FORTRAN being based on Algebra, LISP on list processing, etc. Data and databases did not get “academic legitimacy” until Dr. Codd invented his relational algebra. It had everything academics love—a set of math symbols, including new ones that would drive the typesetters crazy. However, it also had axioms, thanks to Dr. Armstrong. (https://en.wikipedia.org/wiki/Armstrong%27s_axioms)
The immediate result was a burst of papers using Dr. Codd’s relational algebra. However, the next step for a modern academic is to change or drop one of the axioms to see that you can still have a consistent formal system. In geometry, change the parallel axiom (parallel lines never meet) to something else. For example, the replacement axiom is that two parallel lines (great circles) meet at two points on the surface of a sphere. Spheres are real. Furthermore, we could test the new geometry with a real-world model.
Since 1NF is the basis for RDBMS, it was the one academics played with first. And we have real multi-valued databases to see if it works. Most of the academic work was done by Jaeschke and Schek at IBM and Roth, Korth, and Silberschatz at the University of Texas, Austin. They added new operators to the relational algebra and calculus to handle “nested relations” while keeping the relational model’s abstract set-oriented nature. 1NF is inconvenient for handling data with complex internal structures, such as computer-aided design and manufacturing (CAD/CAM). These applications have to handle structured entities, while the 1NF table only allows atomic values for attributes.
Non-first normal form (NFNF) databases allow a column in a table to hold nested relations and break the rule about a column only containing scalar values drawn from a known domain. In addition to NFNF, these databases are also called 2NF, NF2, and ¬NF in the literature. Since they are not part of the ANSI/ISO standards, you will find different proprietary implementations and academic notations for their operations.
Consider a simple example of employees and their children. On a normalized schema, the employees would be in one table, and their children would be in a second table that references the parent:
CREATE TABLE Personnel (emp_name VARCHAR(20) NOT NULL PRIMARY KEY, ..); CREATE TABLE Dependents (dependent_name VARCHAR(20) NOT NULL PRIMARY KEY, emp_name VARCHAR(20) NOT NULL REFERENCES Personnel(emp_name)--- DRI actions ON UPDATE CASCADE ON DELETE CASCADE, ..);
But in an NFNF schema, the dependents would be in a column with a table type, perhaps something like this:
CREATE NF TABLE Personnel (emp_name VARCHAR(20) NOT NULL PRIMARY KEY, dependents TABLE (dependent_name VARCHAR(20) NOT NULL PRIMARY KEY, emp_name VARCHAR(20) NOT NULL, ..), ..);
Which would make for a set of data that you might diagram as:
We can naturally extend the basic set operators UNION
, INTERSECTION
, and DIFFERENCE
and subsets. Extending the relational operations is also relatively easy for PROJECTION
and SELECTION
. The JOIN
operators are a bit harder, but limiting your algebra to the natural or equijoin makes life easier. The important characteristic is that when these extended relational operations are used on flat tables, they behave like the original relational operations.
To transform this NFNF table back into a 1NF schema, you would use an UNNEST
operator. The unnesting, in this case, would make Dependents into its own table and remove it from Personnel. Although UNNEST
is the mathematical inverse to NEST
, the operator NEST
is not always the mathematical inverse of UNNEST
operations. Let’s start with a simple, abstract nested table:
The UNNEST(<subtable>)
operation will “flatten” a sub-table up one level in the nesting:
The NEST
operation requires a new table name and its columns as a parameter. This is the extended SQL declaration:
NEST (G1, G2(F3, F4))
It fails when we try to “re-nest” this step back to the original table.
There is also the question of how to order the nesting. We put the dependents inside the Personnel table in the first example. Children are weak entities; they have to have a parent (a strong entity) to exist. However, we could have nested the parents inside the dependents. The problem is that NEST()
does not commute. An operation is commutative when (A ☉ B) = (B ☉ A)
, if you forgot your high school algebra. Let’s start with a simple flat table:
Now, we do two nestlings to create sub-tables G2
, which is the F3
column and G3
, which is the F4
column. First in this order:
NEST(NEST (G1, G2(F3)), G3(F4))
Now in the opposite order:
NEST(NEST (G1, G3(F3)), G2(F4))
The next question is how to handle missing data. What if Herb’s daughter Mary is lactose-intolerant and has no favorite ice cream flavor in the Personnel
table example? The usual NFNF model will require explicit markers instead of a generic missing value.
Another constraint required is for the operators to be objective, which is covered by the partitioned normal form (PNF). This normal form cannot have empty sub-tables and operations have to be reversible. A relation in PNF is such that its atomic attributes are a super-key of the relation and that any non-atomic component of a tuple of the relation is also in PNF.
OCCURS
is literally the reason this occurs. COBOL has a clause in its DATA DIVISION
with that keyword which indicates a sub-record. I will assume most of the readers do not know COBOL. The super-quick explanation is that the COBOL DATA DIVISION
is like the DDL in SQL, but the data is kept in strings that have a picture (PIC) clause that shows their display format. In COBOL, display and storage formats are the same. The records and fields in COBOL are very physical, unlike the rows and columns of SQL, which are abstract and virtual.
Records are made of a hierarchy of fields, and the nesting level is shown as an integer at the start of each declaration (numbers increase with depth; the convention is to step by fives). Suppose you wanted to store your monthly sales figures for the year. You could define 12 fields, one for each month, like this:
05 MONTHLY-SALES-1 PIC S9(5)V99. 05 MONTHLY-SALES-2 PIC S9(5)V99. 05 MONTHLY-SALES-3 PIC S9(5)V99. ... 05 MONTHLY-SALES-12 PIC S9(5)V99.
The dash is like an SQL underscore, a period is like a semicolon in SQL, and the picture tells us that each sales amount has a sign, up to five digits for dollars and two digits for cents. You can specify the field once and declare that it repeats 12 times with the simple OCCURS clause, like this:
05 MONTHLY-SALES OCCURS 12 TIMES PIC S9(5)V99.
The individual fields are referenced in COBOL by using subscripts, such as MONTHLY
–SALES(1)
. The OCCURS
can also be at the group level, and this is its most useful application. For example, all 25-line items on an invoice (75 fields) could be held in this group:
05 LINE-ITEMS OCCURS 25 TIMES. 10 ITEM-QUANTITY PIC 9999. 10 ITEM-DESCRIPTION PIC X(30). 10 UNIT-PRICE PIC S9(5)V99.
Notice the OCCURS
is listed at the group level, so the entire group occurs 25 times.
There can be nested OCCURS
. Suppose we stock ten products and we want to keep a record of the monthly sales of each product for the past 12 months:
01 INVENTORY-RECORD. 05 INVENTORY-ITEM OCCURS 10 TIMES. 10 MONTHLY-SALES OCCURS 12 TIMES PIC 999.
In this case, INVENTORY-ITEM
is a group composed only of MONTHLY-SALES
, which occurs 12 times for each occurrence of an inventory item. This gives an array of 10 × 12 fields. The only information in this record is the 120 monthly sales figures—12 months for each of 10 items.
Notice that OCCURS
defines an array of known size. However, because COBOL is a file system language, it reads fields in records from left to right. Since there is no NULL
, inserting future values that are not yet known requires some coding tricks. The language has the OCCURS DEPENDING ON
option. The computer reads an integer control field and then expects to find that many occurrences of a sub-record following at runtime. Yes, this can get messy and complicated, but look at this simple patient medical treatment history record to get an idea of the possibilities:
01 PATIENT-TREATMENTS. 05 PATIENT-NAME PIC X(30). 05 PATIENT-NUMBER PIC 9(9). 05 TREATMENT-COUNT PIC 99 COMP-3. 05 TREATMENT-HISTORY OCCURS 0 TO 50 TIMES DEPENDING ON TREATMENT-COUNT INDEXED BY TREATMENT-POINTER. 10 TREATMENT-DATE. 15 TREATMENT-YEAR PIC 9999. 15 TREATMENT-MONTH PIC 99. 15 TREATMENT-DAY PIC 99. 10 TREATING-PHYSICIAN-NAME PIC X(30). 10 TREATMENT-CODE PIC 999.
The TREATMENT-COUN
T has to be handled in the applications to correctly describe the TREATMENT-HISTORY
subrecords. I will not explain COMP-3
(a data type for computations) or the INDEXED BY
clause (array index), since they are not important to my point.
My point is that we had been thinking of data in arrays of nested structures before the relational model. We just had not separated data from computations and presentation layers, nor were we looking for an abstract model of computing yet.
When I described ICD codes as the Dewey decimal of death and despair, I probably should’ve also mentioned that they can be absurd. Here’s a list of the 20 strangest codes in the system. The phrase “subsequent encounter” means it has happened more than once to the patient. I am not sure how you can be sucked into jet engine more than once, but we have a code for it.
1. W220.2XD: Walked into lamppost, subsequent encounter
2. W61.33: Pecked by a chicken
3. W61.62XD: Struck by duck, subsequent encounter
4. W55.41XA: Bitten by pig, initial encounter
5. W59.22XA: Struck By turtle
6. R46.1: Bizarre personal appearance
7. Z63.1: Problems in relationship with in-laws
8. V97.33XD: Sucked into jet engine, subsequent encounter
9. R15.2: Fecal urgency
10.Y92.253: Opera house as the place of occurrence of the external cause
11.V9135XA: Hit or struck by falling object due to accident to canoe or kayak
12. X52: Prolonged stay in weightless environment
13.V94810: Civilian watercraft involved in water transport accident with military watercraft
14.Y92241: Hurt at the library
15. Y92.146: Swimming-pool of prison as the place of occurrence of the external cause
16. Y93.D1: Stabbed while crocheting
17. S10.87XA: Other superficial bite of other specified part of neck, initial encounter
18. Y93.D: V91.07XD: Burn due to water-skis on fire, subsequent encounter
19. V00.01XD: Pedestrian on foot injured in collision with roller-skater, subsequent encounter
20. W22.02XD: V95.43XS: Spacecraft collision injuring occupant
The post First Normal Form Gets No Respect appeared first on Simple Talk.
]]>The post T-SQL Fundamentals: Controlling Duplicates appeared first on Simple Talk.
]]>A great example for a fundamental concept in T-SQL that seems simple but actually involves many subtleties is: duplicates. The concept of duplicates is at the heart of relational database design and T-SQL. Understanding what duplicates are is consequential to many of its features. It’s also important to understand the differences between how T-SQL handles duplicates versus how relational theory does.
In this article I focus on duplicates and how to control them. In order to understand duplicates correctly, you need to understand the concepts of distinctness and equality, which I’ll cover as well.
In my examples I will use the sample database TSQLV6. You can download the script file to create and populate this database and its ER diagram in a .ZIP file here. (You can find all of my downloads on my website at the following address: https://itziktsql.com/r-downloads)
T-SQL is a dialect of standard SQL So, the obvious place to look for the definition of duplicates is in the standard’s text, assuming you have access to it (you can purchase a copy using the link provided) and the stomach for it. If you do look for the definition of duplicates in the standard’s text, you’ll quickly start wondering how deep the rabbit hole goes.
Here’s the definition of duplicates in the standard:
Duplicates
Two or more members of a multiset that are not distinct
So now you need to figure out what multiset and distinct mean.
Let’s start with multiset. This can get a bit confusing since there’s both a mathematical concept called multiset and a specific feature in the SQL standard called MULTISET
(one of two kinds of collection types: ARRAY
and MULTISET
). T-SQL doesn’t support the standard collection types ARRAY
and MULTISET
. However, the mathematical concept of a multiset is quite important to understand for T-SQL practitioners.
Here’s the description of a multiset from Wikipedia:
Multiset
In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements.
In other words, whereas a set is a collection of distinct elements, a multiset allows duplicates. Oops…
I’m not sure why they chose in the standard to define duplicates via multisets. To me, it should be the other way around. I think that it’s sufficient to understand the concept of duplicates via distinctness alone, and later talk about what multisets are.
So, back to the definitions section in the SQL standard, here’s the definition of distinctness:
Distinct (of a pair of comparable values)
Capable of being distinguished within a given context.
Hmm… not very helpful. But wait, there’s more! There’s NOTE 8:
NOTE 8 — Informally, two values are distinct if neither is null and the values are not equal. A null value and a nonnull value are distinct. Two null values are not distinct. See Subclause 4.1.5, “Properties of distinct”, and the General Rules of Subclause 8.15, “<distinct predicate>”.
Let’s start with the meaning of a given context. This has to do with the contexts in SQL where a pair of values can be compared. It could be a query filter, a join predicate, a set operator, the DISTINCT
set quantifier in the SELECT
list or an aggregate function, grouping, uniqueness, and so on.
Next, let’s talk about the meaning of a pair of comparable values. The concept of distinctness is relevant to a pair of values that can be compared. Later the standard contrasts this with values of a user-defined type that has no comparison type, in which case distinctness is undefined.
I also need to mention here that when the standard uses the term value, it actually includes both non-NULL
values and NULLs
. To some, myself included, a NULL
is a marker for a missing value, and therefore the term NULL value is actually incorrect. But what could be an inclusive term for both NULL
s and non-NULL
values? I don’t know that there’s a common industry term that is simple, intuitive and accurate. If you know one, let me know! Since our focus is things that can be compared, maybe we’ll just use the term comparands. And by the way, in SQL, the concepts of duplicates and distinctness are of course relevant not just to scalar comparands, but also to row comparands.
What’s left to understand is what capable of being distinguished means. Note 8 goes into details trying to explain distinctness via equality, with exceptions when NULL
s are involved. And as usual, it sends you elsewhere to read about properties of distinct, and the distinct predicate.
What’s crucial to understand here is that there’s an important difference between equality-based comparison (or inequality) and distinctness-based comparison in SQL. You understand the difference using predicate logic. Given comparands c1
and c2
, here are the truth values for the inequality-based predicate c1 <> c2
versus the distinctness-based counterpart c1 is distinct from c2
:
c1 |
c2 |
c1 <> c2 |
c1 is distinct from c2 |
non-NULL X |
non-NULL X |
false |
false |
non-NULL X |
non-NULL Y |
true |
true |
any non-NULL |
NULL |
unknown |
true |
NULL |
any non-NULL |
unknown |
true |
NULL |
NULL |
unknown |
false |
As you can see, with inequality, when both comparands are non-NULL
, if they are the same the comparison evaluates to false and if they are different it evaluates to true. If any of the comparands is NULL
, including both, the comparison evaluates to the truth value unknown.
With distinctness, when both comparands are non-NULL
, the behavior is the same as with inequality. When you have NULL
s involved, they are basically treated just like non-NULL
values. Meaning that when both comparands are NULL
, IS DISTINCT FROM
evaluates to false, otherwise to true.
Similarly, given comparands c1
and c2
, here are the truth values for the equality-based predicate c1 = c2
versus the distinctness-based counterpart c1
is not distinct from c2
:
c1 |
c2 |
c1 = c2 |
c1 is not distinct from c2 |
non-NULL X |
non-NULL X |
true |
true |
non-NULL X |
non-NULL Y |
false |
false |
any non-NULL |
NULL |
unknown |
false |
NULL |
any non-NULL |
unknown |
false |
NULL |
NULL |
unknown |
true |
As you can see, with equality, when both comparands are non-NULL
, if they are the same the comparison evaluates to true and if they are different it evaluates to false. If any of the comparands is NULL
, including both, the comparison evaluates to the truth value unknown.
With non-distinctness, when both comparands are non-NULL
, the behavior is the same as with equality. When you have NULL
s involved, they are basically treated just like non-NULL
values. Meaning that when both comparands are NULL
, IS NOT DISTINCT FROM
evaluates to true, otherwise to false.
In case you’re not already aware of this, starting with SQL Server 2022, T-SQL supports the explicit form of the standard distinct predicate, using the syntax <comparand 1> IS [NOT] DISTINCT FROM <comparand 2>
. You can find the details here.
As you can gather, trying to understand things from the standard can be quite an adventure.
So let’s try to simplify our understanding of duplicates.
First, you need to understand the concept of distinctness. This is done using predicate logic by understanding the difference between equality-based comparison and distinctness-based comparison, and familiarity with the distinct predicate and its rules.
These concepts are not trivial to digest for the uninitiated, but they are critical for a correct understanding of the concept of duplicates.
Assuming you have the concept of distinctness figured out, you can then understand duplicates.
Duplicates
Comparands c1 and c2 are duplicates if c1
is not distinct from c2
. That is, if the predicate c1 IS NOT DISTINCT FROM c2
evaluates to true.
The comparands c1
and c2
can be scalar comparands, in contexts like filter and join predicates, or row comparands, in contexts like the DISTINCT
quantifier in the SELECT
list and set operators.
Even though standard SQL and the T-SQL dialect, which is based on it, are founded in relational theory, they deviate from it in a number of ways, one of which is the handling of duplicates. The main structure in relational theory is a relation. Relational expressions operate on relations as inputs and emit a relation as output.
The counterpart to a relation is a table in SQL. Similar to relational expressions, table expressions, such as ones defined by queries, operate on tables as inputs and emit a table as output.
A relation has a heading and a body. The heading of a relation is a set of attributes. Similarly, the heading of a Table is a set of columns. There are interesting differences between the two, but that’s not the focus of this article.
The body of a relation is a set of tuples. Recall that a set has no duplicates. This means that by definition, a relation must have at least one candidate key.
Unlike the body of a relation, the body of a table is a multiset of rows. Recall, a multiset is similar to a set, only it does allow duplicates. Indeed, you don’t have to define a key in a table (no primary key or unique constraint), and if you don’t, the table can have duplicate rows.
Furthermore, even if you do have a key defined in a table; unlike a relational expression, a table expression that is based on a base table with a key, does not by default eliminate duplicates from the result table that it emits.
Suppose that you need to project the countries where you have employees. In relational theory you formulate a relational expression doing so, and you don’t need to be explicit about the fact that you don’t want duplicate countries in the result. This is implied from the fact that the outcome of a relational expression is a relation. Suppose you try achieving the same in T-SQL using the following table expression:
SELECT country FROM HR.Employees
As an aside, you might be wondering now why I didn’t terminate this expression, despite the fact that I keep telling people how important it is to terminate all statements in T-SQL as a best practice. Well, you terminate statements in T-SQL. Statements do something. My focus now is the table expression returning a table with the cities where you have employees. The expression is the query part without the terminator, and is the part that can be nested in more elaborate table expressions.
Back to our discussion about duplicates. Despite the fact that the underlying Employees table has a key and hence no duplicates, this table expression which is based on Employees, does have duplicates in its table result:
Run the following code:
USE TSQLV6; SELECT country FROM HR.Employees;
You get the following output:
Of course, T-SQL does give you tools to eliminate duplicates in a table expression if you want to, it’s just that in some cases it doesn’t do so by default. In the above example, as you know well, you can use the DISTINCT
set quantifier for this purpose. People often learn a dialect of SQL like T-SQL without learning the theory behind it. Many are so used to the fact that returning duplicates is the default behavior, that they don’t realize that that’s not really normal in the underlying theory.
If I tried covering all aspects of controlling duplicates in T-SQL, I’d probably easily end up with dozens of pages. To make this article more approachable, I’ll focus on features that involve using quantifiers to allow or restrict duplicates.
T-SQL allows you to apply a set quantifier ALL
| DISTINCT
to the SELECT
list of a query. The default is ALL
if you don’t specify a quantifier.
The earlier query returning the countries where you have employees, without removal of duplicates, is equivalent to the following:
SELECT ALL country FROM HR.Employees;
As you know, you need to explicitly use the DISTINCT quantifier to remove duplicates, like so:
SELECT DISTINCT country FROM HR.Employees;
This code returns the following output:
Similar to the SELECT
list’s quantifier, you can apply a set quantifier ALL
| DISTINCT
to the input of an aggregate function. Also here the ALL
quantifier is the default if you don’t specify one explicitly. With the ALL
quantifier—whether explicit or implied—redundant duplicates are retained. The following two queries are logically equivalent:
SELECT COUNT(country) AS cnt FROM HR.Employees; SELECT COUNT(ALL country) AS cnt FROM HR.Employees;
Both queries return a count of 9 since there are nine rows where country is not NULL
.
Again, you can use the DISTINCT
quantifier if you want to remove redundant duplicates, like so:
SELECT COUNT(DISTINCT country) AS cnt FROM HR.Employees;
This query returns 2 since there are two distinct countries where you have employees.
Note that at the time of writing, T-SQL supports the DISTINCT
quantifier with grouped aggregate functions, but not with windowed aggregate functions. There are workarounds, but they are far from being trivial.
Set operators allow you to combine data from two input table expressions. The SQL standard supports three set operators UNION
, INTERSECT
and EXCEPT
, each with two possible quantifiers ALL
| DISTINCT
. With set operators the DISTINCT
quantifier is the default if you don’t specify one explicitly.
At the time of writing, T-SQL supports only a subset of the standard set operators. It supports UNION
(implied DISTINCT
), UNION
ALL
, INTERSECT
(implied DISTINCT
) and EXCEPT
(implied DISTINCT
). It doesn’t allow you to be explicit with the DISTINCT
quantifier, although that’s the behavior that you get by default, and it supports the ALL
option only with the UNION ALL
operator. It currently does not support the INTERSECT ALL
and EXPECT ALL
operators. I’ll explain all standard variants, and for the missing ones in T-SQL, I’ll provide workarounds.
You apply a set operator to two input table expressions, which I’ll refer to as TE1 and TE2:
TE1 <set operator> TE2
The above represents a table expression.
A statement based on a table expression with a set operator can have an optional ORDER BY
clause applied to the result, using the following syntax:
TE1 <set operator> TE2 [ORDER BY <order by list>];
You’re probably familiar with the set operators that T-SQL supports. Still, let me briefly explain what each operator does:
UNION
: Returns distinct rows that appear in TE1
, TE2
or both. That is, if row R
appears in TE1
, TE2
or both, irrespective of number of occurrences, it appears exactly once in the result.UNION ALL
: Returns all rows that appear in TE1
, TE2
or both. That is, if row R appears m times in TE1
and n times in TE2
, it appears m + n
times in the result.INTERSECT
: Returns distinct rows that are common to both TE1
and TE2
. That is, if row R
appears at least once in TE1
, and at least once in TE2
, it appears exactly once in the result.INTERSECT ALL
: Returns all rows that are common to both TE1
and TE2
. That is, if row R appears m times in TE1
and n times in TE2
, it appears minimum(m, n)
times in the result. For example, if R
appears 5 times in TE1
and 3 times in TE2
, it appears 3 times in the result.EXCEPT
: Returns distinct rows that appear in TE1
but not in TE2
. That is, if a row R
appears in TE1
, irrespective of the number of occurrences, and does not appear in TE2
, it appears exactly once in the output.EXCEPT ALL
: Returns all rows that appear in TE1
but don’t have an occurrence match in TE2
. That is, if a row R
appears m times in TE1
, and n times in TE2
, it appears maximum((m - n), 0)
times in the result. For example, if R
appears 5 times in TE1
and 3 times in TE2
, it appears 2 times in the result. If R
appears 3 times in TE1
and 5 times in TE2
, it doesn’t appear in the result.An interesting question is why would you use a set operator to handle a given task as opposed to alternative tools to combine data from multiple tables, such as joins and subqueries? To me, one of the main benefits is the fact that when set operators compare rows, they implicitly use distinctness-based comparison and not equality-based comparison. Recall that distinctness-based comparison handles NULL
s and non-NULL
values the same way, essentially using two-valued logic instead of three-valued logic. That’s often the desired behavior, and with set operators it simplifies the code a great deal. For example, the following code identifies distinct locations that are both customer locations and employee locations:
SELECT country, region, city FROM Sales.Customers INTERSECT SELECT country, region, city FROM HR.Employees;
This code generates the following output:
Since set operators implicitly use the distinct predicate to compare rows (not to confuse with the fact that without an explicit quantifier they use the DISTINCT
quantifier by default), you didn’t need to do anything special to get a match when comparing two NULL
s and a nonmatch when comparing a NULL
with a non-NULL
value. The location UK, NULL
, London is part of the result since it appears in both inputs.
Also, with no explicit quantifier specified, a set operator uses an implicit DISTINCT
quantifier by default. Remember that with INTERSECT
, as long as at least one occurrence of a row appears in both sides, INTERSECT
returns one occurrence of the row in the result.
As an exercise, I urge you to write a logically equivalent solution to the above code, using either joins or subqueries. Of course it’s doable, but not this concisely.
As mentioned, the standard also supports an ALL
version of INTERSECT
. For example, the following standard query returns all occurrences of locations that are both customer locations and employee locations (don’t run it against SQL Server since it’s not supported in T-SQL):
SELECT country, region, city FROM Sales.Customers INTERSECT ALL SELECT country, region, city FROM HR.Employees;
If you want to use a solution that is supported in T-SQL, the trick is to compute row numbers in each input table expression to number the duplicates, and then apply the operation to the inputs including the row numbers. You can then exclude the row numbers from the result by using a named table expression like a CTE. Here’s the complete code:
WITH C AS ( SELECT country, region, city, ROW_NUMBER() OVER(PARTITION BY country, region, city ORDER BY (SELECT NULL)) AS rownum FROM Sales.Customers INTERSECT SELECT country, region, city, ROW_NUMBER() OVER(PARTITION BY country, region, city ORDER BY (SELECT NULL)) AS rownum FROM HR.Employees ) SELECT country, region, city FROM C;
This code generates the following output:
The location UK
, NULL
, London
appears 6 times in the first input table expression and 4 times in the second, therefore 4 occurrences intersect.
You can handle an except need very similarly. If you’re interested in an except distinct operation, you use the EXCEPT
(implied DISTINCT
) in T-SQL. For example, the following code returns distinct employee locations that are not customer locations:
SELECT country, region, city FROM HR.Employees EXCEPT SELECT country, region, city FROM Sales.Customers;
This code generates the following output:
If you wanted all employee locations that per occurrence don’t have a matching customer location, the standard code for this looks like so:
SELECT country, region, city FROM HR.Employees EXCEPT ALL SELECT country, region, city FROM Sales.Customers;
However, T-SQL doesn’t support the ALL
quantifier with the EXCEPT
operator. You can use a similar trick to the one you used to achieve the equivalent of INTERSECT ALL
. You can achieve the equivalent of EXCEPT ALL
by applying EXCEPT
(implied DISTINCT
) to inputs that include row numbers that number duplicate, like so:
WITH C AS ( SELECT country, region, city, ROW_NUMBER() OVER(PARTITION BY country, region, city ORDER BY (SELECT NULL)) AS rownum FROM HR.Employees EXCEPT SELECT country, region, city, ROW_NUMBER() OVER(PARTITION BY country, region, city ORDER BY (SELECT NULL)) AS rownum FROM Sales.Customers ) SELECT country, region, city FROM C;
This code generates the following output:
Curiously, Seattle appears once in this result of except all but didn’t appear at all in the result of the except distinct version. You might initially think that there’s a bug in the code. But think carefully what could explain this? There are two employees from Seattle and one customer from Seattle. The except distinct operation isn’t supposed to return any occurrences in the result, yet the except all operation is indeed supposed to return one occurrence.
Without proper understanding of the foundations of T-SQL—primarily relational theory and its own roots—it’s hard to truly understand what you’re dealing with. In this article I focused on duplicates. A concept that to most seems trivial and intuitive. However, as it turns out, a true understanding of duplicates in T-SQL is far from being trivial.
You need to understand the differences between how relational theory treats duplicates versus SQL/T-SQL. A relation’s body doesn’t have duplicates whereas a table’s body can have those.
It’s very important to understand the difference between distinctness-based comparison, such as when using the distinct predicate explicitly or implicitly, versus equality-based comparison. Then you realize that in T-SQL, two comparands are duplicates when one is not distinct from the other.
You also need to understand the nuances of how T-SQL handles duplicates, the cases where it retains redundant ones versus cases where it removes those. You also need to understand the tools that you have to change the default behavior using quantifiers such as DISTINCT
and ALL
.
I discussed controlling duplicates in a query’s SELECT
list, aggregate functions and set operators. But there are other language elements where you might need to control them such as handling ties with the TOP
filter, window functions, and others.
What I hope that you take away from this article is the significance of investing time and energy in learning the fundamentals. And if you haven’t had enough of T-SQL Fundamentals, and I’m allowed a shameless plug, check out my new book T-SQL Fundamentals 4th Edition.
May the 4th be with you!
The post T-SQL Fundamentals: Controlling Duplicates appeared first on Simple Talk.
]]>The post The GROUP BY Clause appeared first on Simple Talk.
]]>GROUP BY
. It’s a fairly simple grouping based on values from the FROM
clause in a SELECT
statement. It’s what a mathematician would call an equivalence relation. This means that each grouping, or equivalence class, has the same value for a given characteristic function and the rows are all partitioned into disjoint subsets (groups) that had the same value under that function. The results table can also have columns that give group characteristics; I will get into those in a minute.
In the case of the usual GROUP BY
, the function is equality. (More or less. I’ll get to that in a bit, too, but if you know anything about relational programming you might suspect NULL
to be involved). Another possible function is modulus arithmetic. Taking MOD (<integer expression>, 2)
splits the results into odd and even groups, for example.
Because SQL is an orthogonal language, we can actually do some fancy tricks with the GROUP BY
clause. The term orthogonality refers to the property of a computer language which allows you to use any expression that returns a validate result anywhere in the language. Today, you take this property for granted in modern programming languages, but this was not always the case. The original FORTRAN
allowed only certain expressions to be used as array indexes (it has been too many decades, but some of the ones allowed were <integer constant>
, <integer variable>
, and <integer constant>*<integer variable>
) this was due to the operations allowed by the hardware registers of the early IBM machines upon which FORTRAN
was implemented.
NULL
values have always been problematic. One of the basic rules in SQL is that a NULL
value does not equal anything including another NULL
. This implies each row should either be excluded or form its own singleton group when you use equality as your characteristic function.
We discussed this problem in the original ANSI X3H2 Database Standards Committee. There was one member’s company SQL which grouped using strict equality. They ran into a problem with a customer’s database involving traffic tickets. If an automobile did not have a tag, then obviously the correct data model would have been to use a NULL
value. Unfortunately, in the real world, this meant every missing tag became its own group. This is not too workable in the state of California. A simple weekly report quickly became insanely long and actually hid information.
When an automobile was missing a tag, the convention had been put in something a human being could read and they picked “none” as the dummy value. Then along came somebody who got a prestige tag that read “NONE” to be cute. The system cheerfully dumped thousands of traffic tickets on to him as soon as his new tag got into the system. Other members had similar stories.
This led us to the equivalence relationship which I will call grouping. It acts just like equality for non-NULL
values, but it treats NULL
values as if they are all equal (the IS [NOT] IS DISTINCT
infixed comparison operator did not exist at the time).
The skeleton syntax of a simple SELECT
statement with a GROUP BY
clause is
SELECT <group column expression list> FROM <table expression> [WHERE <row search condition>]0 GROUP BY <column expression list> [HAVING <group search condition>];
Here are the basic characteristics of this construct:
GROUP BY
clause can only be used with a SQL SELECT
statement.GROUP BY
clause must be after the WHERE
clause. (if the query has one. If the WHERE
clause is absent, then the whole table is treated as if it’s one group).GROUP BY
clause must be before the ORDER BY
clause. (if the query has one).GROUP BY
results, you must use the HAVING
clause after the GROUP BY
.GROUP BY
clause is often used in conjunction with aggregate functions.SELECT
clause should also appear in the GROUP BY
clause, whether you have an aggregate function or not.Note that using a GROUP BY
clause Is meaningless if there are no duplicates in the columns you are grouping by.
The original GROUP BY
clause came with a set of simple descriptive statistics; the COUNT
, AVG
, MIN
, MAX
, and SUM
functions. Technically, they all have the option of a DISTINCT
or ALL
parameter qualifier. One of the weirdness of SQL is that various constructs can have parentheses for parameters or lists, and these lists can have SQL keywords inside of the parameters. Let’s look at these basic functions one at a time.
COUNT([ALL | DISTINCT | *] <expression>)
Returns an integer between zero whatever the count of the rows in this group is. If the expression returns NULL
, it is Ignored in the count. The ALL
option is redundant, and I’ve never seen anybody use this in the real world. The DISTINCT
option removes redundant duplicates before applying the function. The *
(asterisk) as the parameter applies only to the COUNT()
function for obvious reasons. It is sort of a wildcard that stands for “generic row”, without regard to the columns that make up that row.
SUM ([ALL | DISTINCT | <expression>)
This is a summation of a numeric value after the NULL
values have been removed from the set. Obviously, this applies only to numeric data. If you look at SQL, you’ve probably knows it has a lot of different kinds of numeric data. I strongly suggest taking a little time to see how your particular SQL product returns the result. The scale and precision may vary from product to product.
AVG()
This function is a version of the simple arithmetic mean. Obviously, this function would only apply to columns with numeric values. It sums all the non-NULL
values in a set, then divides that number by the number of non-NULL
values in that set to return the average value. It’s also wise to be careful about NULL
values; consider a situation where you have a table that models employee compensation. This compensation includes a base salary and bonuses. But only certain employees are eligible for bonuses. Unqualified employees have rows that show the bonus_amt
column value as NULL
. This lets us maintain the difference between an employee who is not qualified and somebody who just didn’t get any bonus ($0.00) in this paycheck. The query should look like this skeleton:
SELECT emp_id, SUM(salary_amount + COALESCE (bonus_amount, 0.00)) AS total_compensation_amount FROM Paychecks GROUP BY emp_id;
MIN() and MAX()
These functions are called extrema functions in mathematics. Since numeric, string and temporal functions have ordering to them, they can be used by these two functions. Technically, you can put in the ALL
and DISTINCT
in the parameter list; which is somewhat absurd.
The reason for picking the small set of descriptive statistics was the ease of implementation. They are still quite powerful and there’s a lot of cute tricks you can try with them. For example, if (MAX(x) = MIN(x)
) then we know the (x
) column has one and only one value in it. Likewise, (COUNT(*) = COUNT(x)
) tells us that column x
does not have any NULL
vales. But another reason for this selection of aggregate functions is that some of these things are already being collected by cost-based optimizers. We essentially got them for free.
The SQL:2006 Standard added some more descriptive aggregate functions to the language. I’ve actually never seen anybody use them, but they’re officially there. Various SQL products have ended other functions in top of these, but let’s go ahead and make a short list of some of what’s officially there.
VAR_POP
This is also written as simply VAR()
or VARP()
. It’s a statistic called the variance, which is defined as the sum of squares of the difference of <value expression>
and the <independent variable expression>
, divided by the number of rows.
It is a statistical measurement of the spread between numbers in a data set. More specifically, variance measures how far each number in the set is from the mean, and thus from every other number in the set. We write it as σ^{2, }, in mathematics. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean. You can see why this might be useful for an optimizer. We also added VAR_SAMP()
To compute the sample variance.
STDEV_POP
This is the population standard deviation. The standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. You can think of it as another way of getting the same information as the variance, but it can go negative.
A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean.
We also have another function defined for a sample population, STDEV_SAMP. There are no surprises here so you can see the “_SAMP” postfix is consistent.
There are other functions that were added standard deal with regression, and correlation. They are even more obscure and are never actually in the real world.
Interestingly enough, standards did not build in the mode (most frequently occurring value in a set), and the median (middle value of a set). The mode has a problem. There can be several of them. If population has an equal number of several different values, you get a multi-modal distribution; SQL functions do not like to return non-scalar values. The median has a similar sort of problem. If the set of values has an odd number of elements in it, then the median is pretty well defined as the value which sits dead center when you order the values. However, if you have an even number of values, then it gets a little harder. You have to take the average of values in the middle of the ordering. Let’s say you have a column with values {1, 2, 2, 3, 3, 3}
. The values in the middle of this list are {2, 3}
which averages out to 2.5
. But if I use an average computed with duplicate values, I get 13/5 = 2.6
instead. This second weighted average is actually more accurate because it shows the slight skew toward 3.
OLAP, or “Online Analytical Processing”, became a fad in the mid-2010’s. Several products devoted to this sort of statistical works hit the market and SQL decided to add extensions that would allow it to be used in place of a dedicated package. This added more descriptive statistics to SQL.
Note: I would be very careful about using SQL as a statistical reporting language. We never intended it for this purpose, so it’s hard to guarantee if the corrections for floating-point rounding errors and other things that need to go into a good statistical package are going to be found in your SQL product.
GROUPING SETS
The grouping sets construct Is the basis for CUBE
and ROLLUP
. You can think of it as shorthand for a series of UNION
queries that are common in reports. But since a SELECT
statement must return a table, we have to do some padding with generated NULL
values to keep the same number of columns in each row. For example:
SELECT dept_name, job_title, COUNT(*) FROM Personnel GROUP BY GROUPING SET (dept_name, job_title);
This will give us a count for departments as a whole and for the job titles within the departments. You can think of it as a shorthand for
BEGIN SELECT dept_name, CAST(job_title AS NULL), COUNT(*) FROM Personnel GROUP BY dept_name UNION ALL SELECT CAST(dept_name AS NULL), job_title, COUNT(*) FROM Personnel GROUP BY job_title; END;
If you think about it for a minute, you’ll see there’s a little problem here. I don’t know if the NULL
values that I’ve created with my CAST ()
Function calls were in the original data or not. That’s why we have a GROUPING (<grouping column name>)
Function to test for it. It returns zero if the NULL
was in the original data and one If it was generated, and therefore belongs to a sub group.
For example,
SELECT GROUPING CASE (dept_name) WHEN 1 THEN ‘Department Total’. ELSE dept_name END AS dept_name, GROUPING CASE (job title) WHEN 1 THEN ‘Job Total’ .ELSE job_title END AS job_title FROM Personnel GROUP BY GROUPING SET (dept_name, job_title);
I’m a little ashamed of this example because it shows me using SQL display formatting on a result. This violates the principle of a tiered architecture.
ROLLUP
The ROLLUP
subclause can be defined with the GROUPING SET
construct, which is why I introduced GROUPING SET
first. In reality, the ROLLUP
is our good old hierarchical report in a new suit. We used to call these breakpoint, or control and break reports in the old days of sequential file processing. The report program set up a bunch of registers to keep running aggregates (usually totals). Every time you pass a control point in the input sequence of records, you dump the registers, reset them with their calculations and begin again. Because of sequential processing, lowest level in his hierarchy would print out first and then the next level of aggregation would appear and so on until you got the grand totals.
Consider GROUP BY ROLLUP (state_code, county_name, emp_id)
as shorthand for
GROUP BY GROUPING SETS (state_code, county_name, emp_id), (state_code, county_name), (state_code), () -- Entire table
Please notice that the order of those columns in the ROLLUP
clause is important. This will give you an aggregate for every employee within each county of each state, an aggregate for every county within each state, that same aggregate for each state and finally a grand aggregate for the entire data set (that is what the empty parentheses mean).
CUBE
The cube supergroup is another SQL 99 extension which is really an old friend with a new name. We used to call it “cross tabs” which is short for cross tabulation. In short, it creates unique groups for all possible combinations of the columns you specify. For example, if you use GROUP BY CUBE on (column1, column2)
of your table, SQL returns groups for all unique values (column1, column2), (NULL, column2), (column1, NULL) and (NULL, NULL)
.
This is just a skim over the options available in the GROUP BY
clause. Anytime you have a query that works on the data with an equivalence relation, is a pretty good chance you will be able to do it using a GROUP BY
.
As a quick programming exercise, I recently saw a post on one of the SQL forums by a less-experienced programmer. He wanted to UPDATE
a flag column (Yes, flags are a bad idea in SQL) from 0 to 1, if any row in the group had a one. The code got fairly elaborate because he had to destroy data as he overwrote existing rows.
Can you write a simple piece of SQL that will give us this information, using a GROUP BY? It definitely is possible.
The GROUP BY Clause
When you’re learning SQL DML, the most complicated clause is typically the GROUP BY
. It’s a fairly simple grouping based on values from the FROM
clause in a SELECT
statement. It’s what a mathematician would call an equivalence relation. This means that each grouping, or equivalence class, has the same value for a given characteristic function and the rows are all partitioned into disjoint subsets (groups) that had the same value under that function. The results table can also have columns that give group characteristics; I will get into those in a minute.
In the case of the usual GROUP BY
, the function is equality. (More or less. I’ll get to that in a bit, too). Another possible function is modulus arithmetic. Taking MOD (<integer expression>, 2)
splits the results into odd and even groups, for example.
Because SQL is an orthogonal language, we can actually do some fancy tricks with the GROUP BY
clause. The term orthogonality refers to the property of a computer language which allows you to use any expression that returns a validate result anywhere in the language. Today, you take this property for granted in modern programming languages, but this was not always the case. The original FORTRAN
allowed only certain expressions to be used as array indexes (it has been too many decades, but some of the ones allowed were <integer constant>
, <integer variable>
, and <integer constant>*<integer variable>
) this was due to the operations allowed by the hardware registers of the early IBM machines upon which FORTRAN
was implemented.
NULL
values have always been problematic. One of the basic rules in SQL is that a NULL
value does not equal anything including another NULL
. This implies each row should either be excluded or form its own singleton group when you use equality as your characteristic function.
We discussed this problem in the original ANSI X3H2 Database Standards Committee. There was one member’s company SQL which grouped using strict equality. They ran into a problem with a customer’s database involving traffic tickets. If an automobile did not have a tag, then obviously the correct data model would have been to use a NULL
value. Unfortunately, in the real world, this meant every missing tag became its own group. This is not too workable in the state of California. A simple weekly report quickly became insanely long and actually hid information.
When an automobile was missing a tag, the convention had been put in something a human being could read and they picked “none” as the dummy value. Then along came somebody who got a prestige tag that read “NONE” to be cute. The system cheerfully dumped thousands of traffic tickets on to him as soon as his new tag got into the system. Other members had similar stories.
This led us to the equivalence relationship which I will call grouping. It acts just like equality for non-NULL
values, but it treats NULL
values as if they are all equal (the IS [NOT] IS DISTINCT
infixed comparison operator did not exist at the time).
The skeleton syntax of a simple SELECT
statement with a GROUP BY
clause is
SELECT <group column expression list> FROM <table expression> [WHERE <row search condition>]0 GROUP BY <column expression list> [HAVING <group search condition>];
Here are the basic characteristics of this construct:
GROUP BY
clause can only be used with a SQL SELECT
statement.GROUP BY
clause must be after the WHERE
clause. (if the query has one. If the WHERE
clause is absent, then the whole table is treated as if it’s one group).GROUP BY
clause must be before the ORDER BY
clause. (if the query has one).GROUP BY
results, you must use the HAVING
clause after the GROUP BY
.GROUP BY
clause is often used in conjunction with aggregate functions.SELECT
clause should also appear in the GROUP BY
clause, whether you have an aggregate function or not.Note that using a GROUP BY
clause Is meaningless if there are no duplicates in the columns you are grouping by.
The original GROUP BY
clause came with a set of simple descriptive statistics; the COUNT
, AVG
, MIN
, MAX
, and SUM
functions. Technically, they all have the option of a DISTINCT
or ALL
parameter qualifier. One of the weirdness of SQL is that various constructs can have parentheses for parameters or lists, and these lists can have SQL keywords inside of the parameters. Let’s look at these basic functions one at a time.
COUNT([ALL | DISTINCT | *] <expression>)
Returns an integer between zero whatever the count of the rows in this group is. If the expression returns NULL
, it is Ignored in the count. The ALL
option is redundant, and I’ve never seen anybody use this in the real world. The DISTINCT
option removes redundant duplicates before applying the function. The *
(asterisk) as the parameter applies only to the COUNT()
function for obvious reasons. It is sort of a wildcard that stands for “generic row”, without regard to the columns that make up that row.
SUM ([ALL | DISTINCT | <expression>)
This is a summation of a numeric value after the NULL
values have been removed from the set. Obviously, this applies only to numeric data. If you look at SQL, you’ve probably knows it has a lot of different kinds of numeric data. I strongly suggest taking a little time to see how your particular SQL product returns the result. The scale and precision may vary from product to product.
AVG()
This function is a version of the simple arithmetic mean. Obviously, this function would only apply to columns with numeric values. It sums all the non-NULL
values in a set, then divides that number by the number of non-NULL
values in that set to return the average value. It’s also wise to be careful about NULL
values; consider a situation where you have a table that models employee compensation. This compensation includes a base salary and bonuses. But only certain employees are eligible for bonuses. Unqualified employees have rows that show the bonus_amt
column value as NULL
. This lets us maintain the difference between an employee who is not qualified and somebody who just didn’t get any bonus ($0.00) in this paycheck. The query should look like this skeleton:
SELECT emp_id, SUM(salary_amount + COALESCE (bonus_amount, 0.00)) AS total_compensation_amount FROM Paychecks GROUP BY emp_id;
MIN() and MAX()
These functions are called extrema functions in mathematics. Since numeric, string and temporal functions have ordering to them, they can be used by these two functions. Technically, you can put in the ALL
and DISTINCT
in the parameter list; which is somewhat absurd.
The reason for picking the small set of descriptive statistics was the ease of implementation. They are still quite powerful and there’s a lot of cute tricks you can try with them. For example, if (MAX(x) = MIN(x)
) then we know the (x
) column has one and only one value in it. Likewise, (COUNT(*) = COUNT(x)
) tells us that column x
does not have any NULL
vales. But another reason for this selection of aggregate functions is that some of these things are already being collected by cost-based optimizers. We essentially got them for free.
The SQL:2006 Standard added some more descriptive aggregate functions to the language. I’ve actually never seen anybody use them, but they’re officially there. Various SQL products have ended other functions in top of these, but let’s go ahead and make a short list of some of what’s officially there.
VAR_POP
This is also written as simply VAR()
or VARP()
. It’s a statistic called the variance, which is defined as the sum of squares of the difference of <value expression>
and the <independent variable expression>
, divided by the number of rows.
It is a statistical measurement of the spread between numbers in a data set. More specifically, variance measures how far each number in the set is from the mean, and thus from every other number in the set. We write it as σ^{2, }, in mathematics. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean. You can see why this might be useful for an optimizer. We also added VAR_SAMP()
To compute the sample variance.
STDEV_POP
This is the population standard deviation. The standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean. You can think of it as another way of getting the same information as the variance, but it can go negative.
A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean.
We also have another function defined for a sample population, STDEV_SAMP. There are no surprises here so you can see the “_SAMP” postfix is consistent.
There are other functions that were added standard deal with regression, and correlation. They are even more obscure and are never actually in the real world.
Interestingly enough, standards did not build in the mode (most frequently occurring value in a set), and the median (middle value of a set). The mode has a problem. There can be several of them. If population has an equal number of several different values, you get a multi-modal distribution; SQL functions do not like to return non-scalar values. The median has a similar sort of problem. If the set of values has an odd number of elements in it, then the median is pretty well defined as the value which sits dead center when you order the values. However, if you have an even number of values, then it gets a little harder. You have to take the average of values in the middle of the ordering. Let’s say you have a column with values {1, 2, 2, 3, 3, 3}
. The values in the middle of this list are {2, 3}
which averages out to 2.5
. But if I use an average computed with duplicate values, I get 13/5 = 2.6
instead. This second weighted average is actually more accurate because it shows the slight skew toward 3.
OLAP, or “Online Analytical Processing”, became a fad in the mid-2010’s. Several products devoted to this sort of statistical works hit the market and SQL decided to add extensions that would allow it to be used in place of a dedicated package. This added more descriptive statistics to SQL.
Note: I would be very careful about using SQL as a statistical reporting language. We never intended it for this purpose, so it’s hard to guarantee if the corrections for floating-point rounding errors and other things that need to go into a good statistical package are going to be found in your SQL product.
GROUPING SETS
The grouping sets construct Is the basis for CUBE
and ROLLUP
. You can think of it as shorthand for a series of UNION
queries that are common in reports. But since a SELECT
statement must return a table, we have to do some padding with generated NULL
values to keep the same number of columns in each row. For example:
SELECT dept_name, job_title, COUNT(*) FROM Personnel GROUP BY GROUPING SET (dept_name, job_title);
This will give us a count for departments as a whole and for the job titles within the departments. You can think of it as a shorthand for
BEGIN SELECT dept_name, CAST(job_title AS NULL), COUNT(*) FROM Personnel GROUP BY dept_name UNION ALL SELECT CAST(dept_name AS NULL), job_title, COUNT(*) FROM Personnel GROUP BY job_title; END;
If you think about it for a minute, you’ll see there’s a little problem here. I don’t know if the NULL
values that I’ve created with my CAST ()
Function calls were in the original data or not. That’s why we have a GROUPING (<grouping column name>)
Function to test for it. It returns zero if the NULL
was in the original data and one If it was generated, and therefore belongs to a sub group.
For example,
SELECT GROUPING CASE (dept_name) WHEN 1 THEN ‘Department Total’. ELSE dept_name END AS dept_name, GROUPING CASE (job title) WHEN 1 THEN ‘Job Total’ .ELSE job_title END AS job_title FROM Personnel GROUP BY GROUPING SET (dept_name, job_title);
I’m a little ashamed of this example because it shows me using SQL display formatting on a result. This violates the principle of a tiered architecture.
ROLLUP
The ROLLUP
subclause can be defined with the GROUPING SET
construct, which is why I introduced GROUPING SET
first. In reality, the ROLLUP
is our good old hierarchical report in a new suit. We used to call these breakpoint, or control and break reports in the old days of sequential file processing. The report program set up a bunch of registers to keep running aggregates (usually totals). Every time you pass a control point in the input sequence of records, you dump the registers, reset them with their calculations and begin again. Because of sequential processing, lowest level in his hierarchy would print out first and then the next level of aggregation would appear and so on until you got the grand totals.
Consider GROUP BY ROLLUP (state_code, county_name, emp_id)
as shorthand for
GROUP BY GROUPING SETS (state_code, county_name, emp_id), (state_code, county_name), (state_code), () -- Entire table
Please notice that the order of those columns in the ROLLUP
clause is important. This will give you an aggregate for every employee within each county of each state, an aggregate for every county within each state, that same aggregate for each state and finally a grand aggregate for the entire data set (that is what the empty parentheses mean).
CUBE
The cube supergroup is another SQL 99 extension which is really an old friend with a new name. We used to call it “cross tabs” which is short for cross tabulation. In short, it creates unique groups for all possible combinations of the columns you specify. For example, if you use GROUP BY CUBE on (column1, column2)
of your table, SQL returns groups for all unique values (column1, column2), (NULL, column2), (column1, NULL) and (NULL, NULL)
.
This is just a skim over the options available in the GROUP BY
clause. Anytime you have a query that works on the data with an equivalence relation, is a pretty good chance you will be able to do it using a GROUP BY
.
As a quick programming exercise, I recently saw a post on one of the SQL forums by a less-experienced programmer. He wanted to UPDATE
a flag column (Yes, flags are a bad idea in SQL) from 0 to 1, if any row in the group had a one. The code got fairly elaborate because he had to destroy data as he overwrote existing rows.
Can you write a simple piece of SQL that will give us this information, using a GROUP BY? It definitely is possible.
The post The GROUP BY Clause appeared first on Simple Talk.
]]>The post Getting Out of Character appeared first on Simple Talk.
]]>Younger programmers have grown up with ASCII and Unicode as the only ways of representing character data in a computer. But in the dark ages, before we invented dirt and entered into the stone age, there were other contenders.
The Control Data Corporation used a 6-bit binary scheme called Field Data. This was due to their hardware. Thanks to IBM’s dominance in the market at the time, punch cards were encoded with Hollerith and later their mainframes and mid-range computers used Extended Binary Coded Decimal Interchange Code (EBCDIC), an 8-bit scheme based off Hollerith. Unfortunately, there were three different versions of EBCDIC from IBM as well as some national code variations.
Back in the days of telegraphs, teletypes used Baudot (invented in the 1870’s) for their five channel paper tapes. (The inventor’s name is also where we get the term “baud” for transmission rates.) The coding scheme was improved a bit to become Murray code by 1901. One of the improvements was that the codes now included formatting codes, such as Carriage Return, Line Feed, Delete and so forth, which survived into the encoding schemes which came afterwards.
The debut of 7-bit ASCII in 1963 is where many programmers began learning programming. Mini-computers first used eight channel teletypes as their input device. Teletypes were relatively cheap and highly available back then.
Today, since computers have gotten smaller and less dominated by IBM, we use American Standard Code for Information Interchange (ASCII), which is a 7-bit character code. Each character is usually stored in one byte and the extra bit becomes a check digit or a way to shift to more ISO/IEC 8859 characters.
ISO/IEC 8859 was a joint ISO and IEC series of standards for 8-bit character encodings. The ISO working group for this standard has been disbanded. It’s job is now handled by Unicode. It was used for the accented letters in European languages that were based on the Latin alphabet. But coverage is not complete. For example, for compatibility, Dutch has been using “ij” as two letters in modern usage instead of one symbol. Likewise, in 2010, the Spanish Academy gave up “ch” and “ll” as separate letters for alphabetical order. (The term for these characters that tie two base characters together as one is a ligature. )
SQL added datatypes NATIONAL CHARACTER
and NATIONAL VARYING CHARACTER
. Nobody ever uses their full names, so we have NCHAR(n)
and NVARCHAR(n)
in our declarations. When we originally made this part of the language, there was no Unicode created yet, so they depended on whatever your implementation defined them as. It was assumed that this would depend on a national standards organization already having representations for the special characters they need for their language. We did not imagine it becoming as generalized as Unicode.
The idea for Unicode began in 1988 and was formalized in 1991 in California. It was going to be a new character encoding under the name “Unicode”, sponsored by engineers at Xerox and Apple. The Unicode Consortium has more details on their website and today (2023 February) they are up to release 15.0.0 (and there is a 15.1.0 in draft mode).
Initially, academic, or political groups that had interest in obscure alphabets or writing systems made contributions to the character sets. Unicode 15.0.0 added 4489 characters, for a total of 149, 186 characters. These additions include 5 new scripts, for a total of 159 scripts, as well as 37 new emoji characters.
Today the big fights are about emoji symbols, and I have no idea why some of these characters are vital to data processing, but they are there!
Characters can be broadly grouped into major categories: Letter, Mark, Number, Punctuation, Symbol, Separator and Other. The names pretty well explain themselves but do have more detailed definitions within each category. For example, letters are ordered by uppercase, lowercase, ligatures (such as æ and œ in English and French) containing uppercase, ligatures containing lowercase, and finally the lowest sort order, an ideograph (like a symbol in Chinese, for example) or a letter in a unicase alphabet. (Unicase alphabets only have one case instead of an upper and lowercase.)
This settles the question about how to handle upper and lowercase letters. It used to be that some correlations would put the lowercase version of a letter immediately after the uppercase version when setting up alphabetical order.
As a trivia question, can you name the alphabets which you do have a case system? We have Latin, Greek, Cyrillic and Arabic; the last one always surprises people, but letters have an initial form, a middle form (remember, it is a connected script), a terminal form and finally a standalone form.
Punctuation includes the underscore and dashes. The fun part comes with all the ways to make brackets. Brackets are considered part of punctuation, not math symbols.
Numbers include vulgar fractions (one number placed above another and a fraction bar, also known as a slash, like ½) and Roman numerals but do not include algebraic symbols. We also have problems with quote marks; do you do a separate open and close quote mark? Which pair do you use? Some Slavic languages prefer >> << and others use << >>.
This is just the beginning and it gets more complicated. For actual text, we still have a lot of legacy encoding systems in the typography business. Early adopters tended to use UCS-2 (the fixed-width two-byte precursor to UTF-16) and later moved to UTF-16 (the variable-width current standard), as this was the least disruptive way. The most widely used such system is Windows NT (and its descendants, 2000, XP, Vista, 7, 8, 10, and 11). Windows uses UTF-16 as the sole internal character encoding. The Java and .NET bytecode environments, macOS, and KDE also use it for internal representation. Partial support for Unicode can be installed on Windows 9x through the Microsoft Layer for Unicode.
For database people, the most important characteristic of Unicode is that all languages will support the simple Latin alphabet, digits, and a limited set of punctuation. The reason this is important is that ISO abbreviations for most units of measure and other standards use only these characters. This lets you insert them into the middle of the weirdest, most convoluted looking alphabets and symbol systems currently on earth, as well as mixing them in with emojis.
This also is a key concept if you ever must design an encoding system, don’t get fancy. Keep your encoding schemes in simple Latin letters and digits and use the “fancy stuff” for text.
If you are old, you might remember a thing called a typewriter. It uses physical type to make images on paper by physically striking an inked ribbon. When we wanted a special character, we had to physically type the characters by either adding extra type elements to the machine (look at the IBM Selectric typewriter “golf balls” with special characters or before that, the Hammond multiplex) or use the backspace on these machines to create a single symbol.
Unicode came up with something like this. Despite the incredibly large range of symbols available to you, you can combine various diacritical marks and letters on top of each other. Without going into much detail, the Unicode standard gives four kinds of normalization. This means that I can put together string of Unicode characters, run them through an algorithm, and if they are equivalent, then I can reduce them to one single display character, or, at worst reduce them into a unique string of characters.
As an example, Å can be written in three ways in Unicode. The first to actually use the Å symbol as a letter of the Swedish alphabet (is your data Swedish?), use the symbol as the unit of measure (the angstrom is a unit of length that equals 0.1 nanometer), or build it from the pieces (uppercase A with a superior small circle accent, using zero spacing between two Unicode characters).
All three look the same when printed, but the first two have completely different meanings and the third is ambiguous, since it might be an attempt to go either of the first two ways.
The Unicode Standard defines two formal types of equivalence between characters when doing comparisons:
For canonical equivalence, let’s consider Vietnamese on a menu at a restaurant (no, I do not read or speak Vietnamese). Their script was constructed for them by French missionaries, which is why it has a Latin alphabet as its basis. An individual letter can have several diacritical marks on it (marks above other characters that change the sound, such as the accent mark in fiancé), and the order that those marks are placed on the base letter don’t really matter; at the end of the construction, we get a single character regardless of the order in which it was constructed. You might have noticed, a pen has a much larger character set than your computer.
Korean, or Hangul as it is more properly called, is actually arranged from phonetic pieces in two dimensions to build syllables that are seen as single characters. The placement and the shape of the phonetic parts follow strict rules of organization. For example, in Hangul 가 is actually built from ᄀ +ᅡ. The leftmost unit changes shape. These two versions of 가 are considered canonically equivalent.
You can also pull a character apart. At one point, the use of ligatures was common in typesetting. For example, the “fi” ligature changes into its components “f” and “i”, and likewise for “ffl” , “ffi”, “st” and “ct” ligatures that were used mostly in the 1800s.
Compatibility equality is less strict and deals with characters that are close. For example, these two characters are not formally equal, but they are compatible with each other:
But wait! It gets even worse with mathematical notation. Just consider fractions; Is “1/4” the same thing as “¼”? The standard considers them compatible, but how many different ways are there to write logical operators? Comparison operators? Various mathematical societies have different typesetting standards, and it can get as bad as any written language.
For more details on this topic, check out this page on the Unicode.org website.
Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are semantically equivalent to each other. The goal of the Unicode Normalization Algorithm for database people Is to guarantee that Comparison operators and join operators, as well as SQL clauses like ORDER BY and GROUP BY will work without surprising users.
Officially once you get into SQL, you can use statements to modify the collation character sets involved. I seriously doubt that many of you will ever use this in your career, as you are probably better off using whatever default you get with your SQL product. However, the more you understand what went into determining how collations are created the better. So, for the record, here are the statements from the CREATE COLLATION
statement that some RDBMS’s provide (others, like SQL Server simply provide you thousands of choices.)
Where do accented letters fit in the collation? Some languages, put their accented letters at the end of the alphabet, and some put each letter after its unaccented form. When German was doing its spelling reforms, this was a big debate about whether the Umlaut-ed three letters made them separate letter the end of the alphabet or they were just different forms the base letter.
Esperanto puts its accented letters immediately after the unaccented form; ”a, b, c, ĉ, d…”; However, since the circumflex was not always available, Esperanto also has the convention of using the combination of a base letter followed by x; “a, b, c, cx, d..” Since the X is not used in the Esperanto alphabet.
Officially, in some SQL implementations you can change all of this at the database level and override the Unicode conventions. If you want to further mess up local language settings, you can also use a
CREATE CHARACTER SET < name of character set> AS < character set source> [<collation clause>] <character set source> ::= GET < character set specification>.
Likewise, there is a CREATE COLLATION
statement in the standard.
CREATE COLLATION <correlation name> FOR <Character set specification> < from existing correlation name> [< pad characteristic>] < pad characteristic> ::= NO PAD | PAD SPACE
The pad characteristic that has to do with how strings are compared to each other. This is based on SQL versus the xBase language conventions. NO PAD
follows the xBase convention of truncating the longer string before doing the comparison. The PAD SPACE
option pads the shorter string with spaces and then begins comparing the strings character by character from left to right. This is the default in SQL.
Normalization
Unicode Normalization Forms are formally defined normalizations of Unicode strings which make it possible to determine whether any two Unicode strings are equivalent to each other. Depending on the particular Unicode Normalization Form, that equivalence The goal of the Unicode Normalization Algorithm For database people Is to guarantee that SQL clauses ORDER BY
and GROUP BY
will actually work without surprising us.
The two most common emoticon combinations were probably “:-)” and “:-(“. These conventions existed in the West and assumed that the reader would turn the line 90° to read it. Japanese users at the time, laid things out horizontally. Emojis evolved from these text based character expressions, but are actually cartoon figures.
There currently are over 800 different emojis and these are sent over 6 billion times a day through Facebook. According to Google, over 90 percent of the world’s population uses emojis and the most popular emoji employed on both Facebook and Twitter is the ‘laugh cry’ face. This makes me fear the literacy level among computer people.
There is a safety rule in life that says: “do not get tattoos in a languages in which you are not fluent.” This rule also applies to databases and text documents. Many years ago, a Canadian Sunday newspaper ran an article on Chinese tattoos that non-Chinese speakers got. What they actually said were things like “this cheap bastard does not tip tattoo artist” And worse. Perhaps you should find someone who does speak the language fluently and asked him to check what’s going into your database. Even if it isn’t obscene or absurd, it still may not be what you meant to put into the database.
Try to keep things as simple as possible and use a minimal character set. You will want to move your data from one platform to another. The days of a one vendor shop have been over for decades.
Remember that SQL is based on a tiered architecture. That means you don’t know when a new kind of presentation layer is going to be added to your database. In a serious application, the worst thing you can do is write SQL that is totally dependent on one release of one product from one vendor. Clean, simple, and portable are good.
The post Getting Out of Character appeared first on Simple Talk.
]]>The post The VALUES clause or building tables out of nothing appeared first on Simple Talk.
]]>VALUES
clause is probably one of the most misused features in SQL. If you look at SQL forums online, you’ll see people use it as the second clause in an insertion statement, but they only use it to construct a single row at a time, thus:
BEGIN INSERT INTO Zodiac (astro_sign, astro_start_date, astro_end_date) VALUES ('Aries', '2022-03-21', '2022-04-19'); INSERT INTO Zodiac (astro_sign, astro_start_date, astro_end_date) VALUES ('Taurus', '2022-04-20', '2022-05-20'); … INSERT INTO Zodiac (astro_sign, astro_start_date, astro_end_date) VALUES ('Pisces', '2023-02-19', '2023-03-20'); END;
Each insertion statement ends with a semi-colon, so they will be executed separately and in the order presented. An optimizer doesn’t dare combine them because there might be a forward reference to previous insertions.
I think people write this kind of code because that this is how you would read punch cards. Each card goes into a card reader, gets buffered, and written in the order presented to the magnetic tape or disk file. Welcome to 1960! Stop mimicking old programming languages like FORTRAN or BASIC that had WRITE statements and put one record at a time into a file. Start thinking of working with entire sets.
The VALUES
clause is more appropriately called a table constructor. Each row constructor within the table is a comma-separated list enclosed in parentheses. Officially, there is an optional keyword ROW
. They can be placed at the start of each list. Nobody does this, and it is a bit redundant, but was required in MySQL.
One of the worst ways of constructing a table is to use the CREATE
or DECLARE
construct to build a temporary table, load it with insertion statements, and finally insert the table into the desired destination. This leads to multiple statements with no way to really optimize the insertion and shows that you are really not thinking in sets yet.
The entire zodiac can be inserted with a single statement like this:
INSERT INTO Zodiac (astro_sign, astro_start_date, astro_end_date) VALUES ( ('Aries', '2022-03-21', '2022-04-19'), ('Taurus', '2022-04-20', '2022-05-20'), ('Gemini', '2022-05-21', '2022-06-21'), ('Cancer', 2022-06- 22', '2022-07-22'), ('Leo', '2022-07-23', '2022-08-22'), ('Virgo', '2022-08-23', '2022-09-22'), ('Libra', '2022-09-23', '2022-10-23'), ('Scorpius', '2022-10-24', '2022-11-21'), ('Sagittarius', '2022-11-22', '2022-12-21'), ('Capricorn', '2022-12-22', '2023-01-19'), ('Aquarius', '2023-01-20', '2023-02-18'), ('Pisces', '2023-02-19', '2023-03-20') );
Given a whole set of rows, the optimizer can deal with a single atomic statement. Not only does it save execution time as compared to the row-at-a-time model of insertion, but it presents the optimizer with an opportunity to improve things. The insertion statement can rearrange the list of new rows and pick an optimal ordering. It also means that if one of my rows had an error in it, I wouldn’t have to back out all of the other rows. If I wanted a proper ACID transaction model, I would’ve had to back out each individual insert up until I came to the insertion that gave me the error.
Here’s the basic syntax. Please note that besides including an expression of the proper data type, you can use the keywords DEFAULT
or NULL
in a row constructor. Obviously, those values must make sense in relation to the declaration of the table into which you are inserting.
VALUES (<row value expression list>) [ , ...n ] <row value expression list> ::= {<row value expression> } [ , ...n ] <row value expression> ::= { DEFAULT | NULL | <expression> }
Please remember that an expression is not always a simple constant. In fact, it’s very handy to use the CAST (<exp> AS <data type>)
function as a way to assure that a column in the constructed virtual table has a known data type:
VALUES (CAST(‘foobar’, AS NVARCHAR(10), CAST(42 AS INTEGER), CAST(3.14159 AS REAL))
The AS
keyword can also be used, to give the constructed tables each a name. Here is skeleton:
SELECT X.a, X.b, X.c, Y.a, Y.b, Y.c FROM (VALUES (1,2,3) , (4,5,6) ) AS X(a, b, c) , (VALUES (1,2,3), (4,5,6) ) AS Y(a,b,c) WHERE X.a =Y.a AND X.b = Y.b AND X.c <= 0.0;
The MERGE
statement was added to Standard SQL several years ago. It was based on a proposal by ANSI representatives from Oracle and IBM, but forms of it had already existed in other products, though under a different name. The most common one was UPSERT
from Postgres. Let’s jump right into it.
One table expression is the target, the table you are trying to modify. The other table expression is the source, the table that provides the modifications. Presumably, you want the target to persist, but you don’t need the source to persist after the updates and insertions are done.
The MERGE
clause defines the target; the USING
clause defines the source. And the ON
clause matches the two tables. The WHEN [NOT] MATCHED ...THEN
clauses determine the action to be taken
MERGE INTO Sales.Sales_Reasons AS Target USING (VALUES ('Recommendation', 'Other'), ('Review', 'Marketing'), ('Internet', 'Promotion')) AS Source (new_sales_name, new_reason_type) ON Target.sales_name = Source.new_sales_name WHEN MATCHED THEN UPDATE SET reason_type = Source.new_reason_type WHEN NOT MATCHED THEN INSERT (sales_name, reason_type) VALUES (new_sales_name, new_reason_type);
It’s easier to think of the MERGE
statement as a program, written as a single statement Instead of having IF–THEN–ELSE
logic or CASE
expressions in multiple statements.
Obviously, updating makes sense only when there is a match, and inserting makes sense only when there is not a match. The standards allow for either of these clauses to include an optional … AND <search condition>
, so you can add quite a bit of logic to this one statement. Technically, the WHEN clauses list can finish with the ELSE
IGNORE
; it acts as a placeholder just as the ELSE
clause did in the CASE
expression. Microsoft has more extensions to the syntax, and there have been some performance issues. If you are using it in SQL Server, I strongly suggest checking both the syntax and current performance in whichever version of SQL Server you’re running.
The VALUES
clause is ANSI standard and implemented by many relational database vendors. The VALUES
clause can save typing, and it’s also the rare case when easier can mean better performance because the rows will be treated as a set.
The post The VALUES clause or building tables out of nothing appeared first on Simple Talk.
]]>The post Combinations, permutations, and derangements appeared first on Simple Talk.
]]>The factorial function is usually written as (n)!
, and it is defined as the product of the first (n)
natural numbers. Thus, 5!
evaluates to (5 · 4 · 3 · 2 · 1) = 120
. As usual, zero becomes a special case, 0! = 1
which can be proven with a slightly different derivation of a factorial. Instead of defining it as a product, define it recursively, like this: n! = CASE WHEN n = 0 THEN 1 ELSE n · (n-1)! END.
Showing the process one step at a time, the recursion unrolls like this:
5! = 5 · 4!
4! = 4 · 3!
3! = 3 · 2!
2! = 2 · 1!
1! = 1 · 0!
Look at the last step of the recursion. Now divide both sides by one to get (1! / 1) = 0!
or 1 = 0!
Notice that everything done so far is procedural, not set-oriented. RDBMS folks prefer to get away from procedural code. For the rest of this article, you can think of n!
as the number of ways to arrange (n)
elements of a set into a sequence. Clearly, if you have one element, then you have only one arrangement. But likewise, if you have zero elements, you are also done. Just as you have only one empty set, so you also have only one empty sequence.
Yes, it is possible to define factorials for negative numbers, imaginary numbers, the gamma function for real numbers, and other things. Unless you’re a math major, you will have absolutely no use for any of these fancy tricks. Since SQL is a database language and not a computational one, you might want to use table lookup and not this recursive definition. You can find a table of the numbers one to hundred to populate the table. The factorial function gets big fast!
0! = 1
1! = 1
2! = 2
3! = 6
4! = 24
5! = 120
6! = 720
7! = 5,040
8! = 40,320
9! = 362,880
10! =3,628,800
11! = 39,916,800
12! = 479,001,600
Combinations assume there is a set of (n)
distinct elements, and you want to get the number of subsets you can pull from this set. There is no concern with exactly what elements go in a particular subset, just the count. The order is not important.
One common notation for this, though not the only notation, is nCr
read as Set of (n) things, Choose (r) of them. Notice that (0 ≤ r ≤ n)
, and that you can choose an empty set. Given a set, the number of subsets in it is 2n ; For example, {a, b, c} has subsets {}, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} for total count of 8.
I’ve never understood the term ‘combination locks’ when the order of the numbers that you put into it is very important. Perhaps traditional mechanical security features were not designed by mathematicians.
Permutations also assume there is a set of (n)
distinct elements, and the goal is to get the number of sequences from this set in any order. However, in this case, the order does matter; (a, b, c) is not the same permutation as (c, b, a).
To be honest, I just think that the term ‘derangement’ is really cool. Given two sequences of the same length and with the same data elements, one is a derangement of the other when the data elements do not appear in the same sequence position in the two sequences. For example, (a, b, c) is not a derangement of (c, b, a) because the data element ‘b’ is in the second position in both sequences. Instead, there are two derangements, (c, a, b) and (b, c, a).
A derangement can also be called a permutation with no fixed points. The two notations for derangement of (n)
elements are either !(n)
or D(n)
. I find the first one with the leading exclamation point to be a bad choice. It looks too much like a factorial.
The number of derangements of a set with (n)
distinct elements is given by the recursive formula : D(n) = (n-1)[D(n-1) + D(n-2)]
. If you know D(1) = 0
and D(2) = 1
, you can generate subsequent values for D(n)
. Or you would prefer : D(n) = n D(n-1) + (-1)^{n}
?
Obviously, the derangements will be fewer in number than the original permutation.
Derangements of n
Elements
D(n)
D(0) = 1
D(1) = 0
D(2) = 1
D(3) = 2
D(4) = 9
D(5) = 44
D(6) = 265
D(7) = 1,854
D(8) = 14,833
D(9) = 133,496
D(10) = 1,334,961
D(11) = 14,684,570
D(12) = 76,214,841
Derangements provide a way to make sure that everyone gets a fair shake when making job assignments. Nobody is stuck in the same job from one assignment cycle to the next. The idea is that eventually, all your personnel would be assigned to every job, but you don’t necessarily know in which assignment cycle any employee will get a particular job.
My favorite use for derangement is when an office does a “Secret Santa” Christmas gift program. If you have never had one of these, the idea is that everyone brings a wrapped gift to work. The gift-giver then picks the name of another employee randomly, and the second employee becomes the recipient. The first rule is that you don’t give a gift to yourself. A second rule is that you’re not supposed to be able to figure out who is the gift giver.
Gift givers are the set, and the recipients are supposed to be an unpredictable derangement. You try doing this by drawing one name at a time, and very quickly a wind up with a situation where only a few possible derangements are left. So much for maintaining secrecy. To get an idea how this works, look at the prior example, with only three data elements in the original permutation. You really need to pick your derangement all at once.
A cute trick for doing this in the real world is to put everybody’s name on a card twice, once on the top and once on the bottom. Shuffle the card deck. Cut the cards in half, leaving the top and bottom halves paired. Take the top half-card from the top halves deck and put it on the bottom of the same top halves deck. Now draw the pairs of cards from the decks, assembling the tops and bottoms to make a new card. The result will be a new card deck with an unpredictable derangement.
While all these are interesting, a database person is probably going to be more interested in actually generating the permutations, combinations, and derangements. The problem is that SQL is not really built for this kind of computation. The sets are in databases are in the form of tables, which all have a fixed number of columns. Databases don’t have arrays, link lists, or other data structures, so it’s faked with tables.
If all the columns are the same data element, then this is an example of de-normalizing a table with a repeated group. Here’s a simple example for (n = 3)
CREATE TABLE #Permutations (c1 INTEGER NOT NULL, c2 INTEGER NOT NULL, c3 INTEGER NOT NULL, PRIMARY KEY (c1, c2, c3), CONSTRAINT Unique_columns CHECK ( c1 NOT IN (c2, c3) AND c2 NOT IN (c1, c3) AND c3 NOT IN (c1, c2) ));
Remember that all tables must have a key. In these cases, the rows that represent a particular set all have to be unique, and so do all of the columns in that row. This means each row is a key.
The CHECK()
constraint ensures that all columns are unique with a row. A lot of the ranges of the dimensions, constraints about membership, etc., in other data structures must be explicitly defined in SQL check constraints. This is one of the many reasons that I tell newbies that 80 to 90% of the work in SQL is done in the DDL, not the DML. Think about trying to write constraints into procedural code for 500 application programs; instead of putting it in one CHECK()
constraint, you must duplicate the same code and magically hope that you get it right in every procedural chunk in your system. And when the rules change, you must go back through those 500 pieces of procedural code and update them. Lots of luck with that.
Modeling combinations is easy. Since order doesn’t matter, any simple list of the values will do. That’s a perfect description of a table with one column.
There are several algorithms for generating permutations. Two of the best known are the Heap algorithm (1963) and the Fike algorithm. Both are recursive and are based on the facts that
1) There are n! permutations of the set {1, 2, 3, …,}. This let you know how many times you have to cycle through a loop or how deep your recursion has to go.
2). The next permutation can be generated from the current permutation without fear of duplication.
Robert Sedgwick (you might know from his textbooks that are widely used) wrote a paper on the various methods for generating permutations. He classified them as
1 METHODS BASED ON EXCHANGES
Recursive methods
Adjacent exchanges
Factorial counting
“Loopless” algorithms
2 OTHER TYPES OF ALGORITHMS
Nested cycling
Lexicographic algorithms
Random permutation
https://en.wikipedia.org/wiki/Heap%27s_algorithm
https://academic.oup.com/comjnl/article/19/2/156/408726. The Computer Journal, vol 19 Issue 2. pages 156-159
https://homepage.math.uiowa.edu/~goodman/22m150.dir/2007/Permutation%20Generation%20Methods.pdf
The post Combinations, permutations, and derangements appeared first on Simple Talk.
]]>The post The problem with averages appeared first on Simple Talk.
]]>AVG(<expression>)
function. The problem is, while it simple to compute, it has all kinds of problems.
The average of a set of data value is supposed to be the “central value” of a set of data. Mathematically, it is defined as the ratio of summation of all the data to the number of units present in the set. In terms of statistics, the average of a given set of numerical data is also called arithmetic mean. For example, the average of 2, 3 and 4 is (2+3+4)/3 = 9/3 =3. So here 3 is the central value of this set.
But there are problems. The first question you have to ask is, does it make any sense to add up the values in your data set? What if my data set is the colors of bikinis sold during the 2021 swimsuit season? The question makes no sense because color is a discrete, nominal variable and doing math on colors is absurd. One of my favorite T-shirts that illustrates this problem beautifully reads “on a scale from 1 to 10, what color is your favorite letter of the alphabet?” It illustrates how some scales just don’t work for some data sets. However, I might be able to ask in a meaningful way, what is the average size of the swimwear sold in that swimsuit season. This assumes I have a scale for the sizes that is pretty much continuous numeric values makes sense. I am looking for an absolute, an interval scale, or a ratio scale for my data.
Oops! This doesn’t quite work out either. The complete bikini consists of a top and a bottom, which might be bought is separates today. This is my “unit of work” which I have to assemble before I can do any sort of statistics on it. I am going to assume there is probably a correlation among the separates, but it is not perfect, which is why they are sold as separates. The classic example of this Aggregation fallacy is when you see the first name that is most common on earth is “Mohamed” and the most common last name is “Wang”, so you’d thus conclude that the most common name on earth must be “Mohamed Wang” based on your data.
The next obvious question is, “is there actually any tendency to only one central value?” Your data could be random garbage, thus the average it will produce will also be random garbage. Socialist countries where everyone is either very, very rich (the ruling elite) or very, very poor (the rest of the country) will have no actual average people in the population!
Another problem is that averages are strongly influenced by outliers. For example, Elon Musk recently moved to Austin, Texas. The net worth of the “average citizen of Austin” just jumped up substantially. However, I noticed no increase in my personal net worth, and neither did anyone else, with the exception of one or two real estate agents.
It could get even worse. A severe distribution of data could be a multi-hump camel, where the data is clustered, distinct separate groups. Is the average college athlete, a large varsity football player, a golfer, or a member of the girls’ swim team? Where did you take your sample?
This leads us to a thing called Simpson’s Paradox, which has nothing to do with Homer Simpson. This is when a data set aggravates to a whole with one trend, but the components that went into it show the opposite trend. Consider US median wage decline. From 2000 to 2013, the median US wage rose about 1%, adjusted for inflation. However, over the same period, the median wage for high school dropouts, high school graduates with no college education, people with some college education, and people with Bachelor’s or higher degrees have all decreased.
In other words, in every educational subgroup the median wage was lower in 2013 than it was in 2000. How can both things be true? The workforce changed over those 13 years: there are now many more college graduates (who get higher-paying jobs) than there were in 2000, but wages for college graduates collectively have fallen at a much slower rate (down 1.2%) than for those of lower educational attainment whose wages have fallen precipitously, down 7.9% for high school dropouts. The growth in the proportion of college graduates swamps the wage decline for specific groups.
The geometric mean is technically defined as “the n-th root product of n numbers” and it is used when working with percentages, which are derived from values, while the standard arithmetic mean works with the values themselves. This calculation considers the effects of compounding.
The geometric mean is the average value which signifies the central tendency of the set of numbers by taking the root of the product of their values. Basically, we multiply the (n) values together and take the n-th root of the results numbers, where (n) is the total number of values. For example: for a given set of two numbers such as 8 and 1, the geometric mean is equal to √(8×1) = √8 = 2√2
.
In general, given the set of observations { x₁, x₂, ..., xₙ}
, the formula to calculate the geometric mean is:
n√ (x₁ · x₂ · ... · xₙ)
This calculation isn’t as formidable as it looks if you remember logarithms.
The geometric mean has advantages over the arithmetic mean for particular applications. It is used in stock indexes because many of the value line indexes which are used by financial departments make use of the geometric mean. For example, it’s used to calculate the annual return on the investment portfolio, in finance to find the average growth rates which are also known as the compounded annual growth rate (CAGR) and in biological studies like cell division and bacterial growth rate. In general, look for something with a growth rate.
1) The geometric mean for the given data set is always less than the arithmetic mean for the data set.
2) If each value in the data set is substituted by the geometric mean, then the product of the values remains unchanged.
3) The ratio of the corresponding observations of the geometric mean in two series is equal to the ratio of their geometric means.
4) The products of the corresponding items of the geometric mean in the two series are equal to the product of their geometric mean.
5) The geometric mean is less influenced by outliers than the arithmetic mean.
5) You can use an online Geometric Mean Calculator or use the GEOMEAN function in Excel.
This is another one of what are called the Pythagorean means. They get this name because originally, they were defined geometrically by constructions. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the observations. It has some nice properties. The most common examples of ratios are that of speed and time, work and time, etc.
What is the Definition of the Term Harmonic Mean?
The Harmonic Mean gives less weight to the larger values and more weight to the smaller values to balance the values properly. The harmonic mean is often used to calculate the average of the ratios or rates of the given values because it equalizes the weights of each data point and avoids problems with outliers.
Since the harmonic mean is the reciprocal of the average of reciprocals, the formula to define the harmonic mean Is a simple definition, but with some caveats:
Given the data set {x1, x2, x3,…, xn}
Harmonic Mean(H) = n / ((1/x_{1})+(1/x_{2})+(1/x_{3})+…+(1/x_{n}))
Given data: {8, 9, 6, 11, 10, 5}
Harmonic mean = 6/((⅛)+(1/9)+(⅙)+(1/11)+(1/10)+(⅕))
= 6/0.7936
= 7.560 (to 3 places)
The Harmonic mean gets its name from Pythagorean theory of music, length of strings on a stringed instrument and the chords they produce. It is also used in computing Fibonacci Sequences. By the way, there is also a function in Excel called HARMEAN
that you can use.
1) For all the observations at constant, say c, then the harmonic means calculated of the observations will also be c.
2) The harmonic mean can also be evaluated for the series having any negative values.
3) If any of the values of a given series is zero then its harmonic mean cannot be determined as the reciprocal of zero doesn’t exist. If there are no zeros in the data set, then the relationships among these three means will be (Arithmetic mean > Geometric mean > Harmonic mean).
The harmonic mean is least affected by fluctuation in sampling. But we have to have a complete sampling of data elements, the term should all be positive and none of them can be zero.
To avoid problems like Elon Musk moving to your city and throwing off the average income, statisticians use other methods like the Harmonic Mean and Geometric Mean to more accurately summarize a set of data.
The post The problem with averages appeared first on Simple Talk.
]]>The post How to replace an identity column with a sequence number appeared first on Simple Talk.
]]>Microsoft introduced the sequence number objects starting with SQL Server 2012. A sequence object generates sequence numbers based on starting and increment values, similar to an identity column, but it has additional features. Over time, you might find that the additional benefits of a sequence number have you wanting to replace an identity column with a sequence number. This article demonstrates two options for replacing an identity column with a sequence number.
An Identity column might not be flexible enough to support all business requirements an application might need around a series of identity values. When this occurs, you might find the sequence number object more flexible. Here are a few of the reasons why you might want to change an identity column to a sequence number:
If you decide to change an identity column to a sequence number, there is more than one way to accomplish the conversion. This article demonstrates two different options. One option is to modify your table to add a sequence number column and then delete the identity column. Another option is to create another table that uses a sequence object and then use the ALTER
TABLE SWITCH
operation. This article explores both of these options and provides examples of how these options can be used to replace an identity column with a sequence number.
Replacing an identity column in an existing table using a column populated with a sequence number requires a few steps. To show you how this example works, first create a couple of sample tables.
The first sample table is named Sample and can be created using the script in Listing 1.
Listing 1: Code to create Sample table.
USE tempdb; GO CREATE TABLE Sample (ID int identity(1,1) NOT NULL, SampleName varchar(30) NOT NULL, CONSTRAINT PK_Sample_ID PRIMARY KEY CLUSTERED(ID ASC)); INSERT INTO Sample(SampleName) VALUES ('First'), ('Second'), ('Third'); SELECT * FROM Sample;
Running the code from Listing 1 produces the rows shown in Report 1.
Report 1: Rows generated in the Sample table
The Sample table has the identity column that will be changed to use a sequence number.
When swapping out an identity column for a column populated with a sequence number, you need to be careful not to mess up the tables with foreign key references to the identity column of the table being modified. Therefore, this example has a second table named Sample2 created in Listing 2. Sample2 simulates a table with a foreign key reference to the primary key associated with the ID column in the first table.
Listing 2: Creating Sample2 table
USE tempdb; GO CREATE TABLE Sample2 (ID int, CONSTRAINT FK_Sample2_ID FOREIGN KEY (ID) REFERENCES Sample(ID));
The first step to replacing an identity column with a sequence number is to create a new column in the Sample table to store the sequence number column. For this example, the sequence number column has the same data type as the original identity column, an INT data type. The code in Listing 3 is used to add the new sequence number column to the Sample table.
Listing 3: Added sequence number column
USE tempdb; GO ALTER TABLE Sample ADD SequenceNumber int NULL; GO
Once the new sequence number column has been added to the Sample table, the next step is to update the new column with the appropriate value for all the existing records. The new SequenceNumber column will be populated with the same values as the ID column. Therefore to update the SequenceNumber column, run the code in Listing 4.
Listing 4: Updating the SequenceNumber column
USE tempdb; GO UPDATE Sample Set SequenceNumber = ID; SELECT * FROM Sample;
The output in Report 2 is produced after running Listing 4.
Report 2: Sample table
The identity column of the Sample table has a primary key named PK_Sample_ID associated with it. This primary key needs to be removed from the ID
column and then moved to the new SequenceNumber
column. Before this can be done, any foreign keys that reference this primary key need to be dropped. To find all the foreign keys that refer to the PK_Sample_ID primary key, run the code in Listing 5.
Listing 5: Finding all foreign keys
USE tempdb; GO SELECT FK.TABLE_NAME as ForeignKeyTable, C.CONSTRAINT_NAME as Constraint_Name FROM INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS C INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS FK ON C.CONSTRAINT_NAME =Fk.CONSTRAINT_NAME INNER JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS PK ON C.UNIQUE_CONSTRAINT_NAME=PK.CONSTRAINT_NAME INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE CU ON C.CONSTRAINT_NAME = CU.CONSTRAINT_NAME INNER JOIN ( SELECT i1.TABLE_NAME, i2.COLUMN_NAME FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS i1 INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE i2 ON i1.CONSTRAINT_NAME =i2.CONSTRAINT_NAME WHERE i1.CONSTRAINT_TYPE = 'PRIMARY KEY' ) PT ON PT.TABLE_NAME = PK.TABLE_NAME WHERE PK.TABLE_NAME = 'Sample' and PT.COLUMN_NAME = 'ID';
When the code in Listing 5 is run, the output in Report 3 is generated.
Report 3: Output created when Listing 5 is run.
To remove the only foreign key identified, the script in Listing 6 can be run.
Listing 6: Removing foreign key
USE tempdb; GO ALTER TABLE Sample2 DROP CONSTRAINT FK_Sample2_ID;
Keep in mind any foreign keys dropped should be recreated, so consider retaining the foreign key information.
With all the foreign keys removed, the primary key on the Sample table can be dropped using the script in Listing 7.
Listing 7: Removing primary key
USE tempdb; GO ALTER TABLE Sample DROP CONSTRAINT PK_Sample_ID; GO
With the primary key removed, the identity column can be dropped using the code in Listing 8.
Listing 8: Removing identity column.
USE tempdb; GO ALTER TABLE Sample DROP COLUMN ID ; GO
For the sequence number column to have the same name as the deleted identity column, it must be renamed. The script in Listing 9 performs this rename operation.
Listing 9: Renaming SequenceNumber column
USE tempdb; GO EXEC sp_rename 'Sample.SequenceNumber', 'ID', 'COLUMN'; GO
Identity columns are defined to have a NOT
NULL
requirement. Therefore to make the new sequence number column mirror the original identity column properties, the NOT
NULL
requirement must be added to the sequence number column. The script in Listing 10 alters the ID
column to not allow nulls and adds a new primary key constraint to replace the primary key deleted in step 4.
Listing 10: Adding NOT NULL requirement and primary key
USE tempdb; GO ALTER TABLE Sample ALTER COLUMN [ID] int NOT NULL; GO ALTER TABLE Sample ADD CONSTRAINT PK_Sample_ID PRIMARY KEY CLUSTERED (ID ASC) ; GO
Because the Sample table already has some rows, the highest value for the new ID
column needs to be determined. This value will be used to set the starting value for the new sequence number object. The starting value for the sequence number object will be the maximum value for the ID
column in the Sample table plus the increment value for the sequence object (which in this example will be 1, the same as the original identity increment value). The dynamic TSQL code in Listing 11 can be used to identify the highest ID
column value and create the new sequence number object with the correct START
value.
Listing 11: Creating Sequence number object
USE tempdb; GO DECLARE @NewStartValue int; DECLARE @IncrementValue int = 1; DECLARE @CMD nvarchar(1000); SELECT @NewStartValue = MAX(ID) + @IncrementValue FROM Sample; SET @CMD = 'CREATE SEQUENCE Sample_SequenceNumber AS INT START WITH ' + RTRIM(CAST(@NewStartValue as CHAR)) + ' INCREMENT BY ' + RTRIM(CAST(@IncrementValue AS CHAR)); EXEC sp_executesql @CMD GO
Care should be used when executing dynamic SQL to ensure you don’t cause a SQL injection issue. Therefore before executing any dynamic SQL, make sure you are not potentially opening the door for SQL injection issues.
When new rows are added to an identity column field, the value for the identity column is automatically populated with the next identity value by default. To get the new ID
column to automatically populate with a sequence number value, a default constraint needs to be added to the new ID
column. This constraint can be added by using the code in Listing 12.
Listing 12: Setting the ID column default value
USE tempdb; GO ALTER TABLE Sample ADD CONSTRAINT ID_Default DEFAULT (NEXT VALUE FOR Sample_SequenceNumber) FOR ID; GO
In Step 3, the one foreign key constraint that referenced the primary key on the Sample table was deleted. This step adds back that deleted foreign key reference, using the code in Listing 13.
USE tempdb; GO ALTER TABLE Sample2 ADD CONSTRAINT FK_Sample2_ID FOREIGN KEY (ID) REFERENCES Sample(ID);
With the identity column swapped out with a column populated by the sequence number object, all that is left to do is test out the new schema definition. This testing verifies that the new ID
column is populated with the next sequence number every time a new row is added. The testing can be done by running the code in Listing 14.
Listing 14: Inserting three new rows into the Sample table
USE tempdb; GO INSERT INTO Sample (SampleName) VALUES ('Fourth'), ('Fifth'), ('Sixth'); SELECT * FROM Sample;
When the code in Listing 14 is run, the results in Report 4 are created.
Report 4: Output when Listing 14 is executed
By reviewing the output in Report 4, you can see that the last three rows inserted got the next three sequential sequence numbers (4, 5, and 6). You might also notice that the position of the ID
column is no longer in the first ordinal position in the table as the original ID
column was.
Another option to replace an identity column is to use the ALTER
TABLE
SWITCH
operation. This technique is often used with table partitioning, which is out of scope for this article, but it is useful in this scenario. The SWITCH
operation doesn’t move the data. Instead, it switches the partition between the source and target tables. This process simplifies the migration, but some requirements need to be followed. Below are those requirements as found in the Microsoft Documentation:
Before learning how to use the SWITCH
operation to replace an identity column with a column populated with a sequence number, clean the database artifacts created in Option 1 and recreate the sample tables and sequence object. To perform this cleanup and recreation of objects, execute the code in Listing 15.
Listing 15: Cleanup and recreation of sample tables
USE tempdb; GO DROP TABLE Sample2; DROP TABLE Sample; DROP SEQUENCE Sample_SequenceNumber; GO CREATE TABLE Sample (ID int identity(1,1) NOT NULL, SampleName varchar(30) NOT NULL, CONSTRAINT PK_Sample_ID PRIMARY KEY CLUSTERED(ID ASC)); INSERT INTO Sample(SampleName) VALUES ('First'), ('Second'), ('Third'); CREATE TABLE Sample2 (ID int, CONSTRAINT FK_Sample2_ID FOREIGN KEY (ID) REFERENCES Sample(ID)); GO USE tempdb; GO DECLARE @NewStartValue int; DECLARE @IncrementValue int = 1; DECLARE @CMD nvarchar(1000); SELECT @NewStartValue = MAX(ID) + @IncrementValue FROM Sample; SET @CMD = 'CREATE SEQUENCE Sample_SequenceNumber AS INT START WITH ' + RTRIM(CAST(@NewStartValue as CHAR)) + ' INCREMENT BY ' + RTRIM(CAST(@IncrementValue AS CHAR)); EXEC sp_executesql @CMD GO
To use the SWITCH
operation the target table needs to be created. Use the code in Listing 16 to create the new table that uses a sequence number to populate the ID
column instead of an identity specification.
Listing 16: Create a new table
USE tempdb; GO CREATE TABLE Sample_New ( ID int NOT NULL DEFAULT NEXT VALUE FOR Sample_SequenceNumber, SampleName varchar(30) NOT NULL, CONSTRAINT PK_Sample_New_ID PRIMARY KEY CLUSTERED(ID ASC) ); GO
The SWITCH
operation will fail if the primary key on the table being switched is referenced by any foreign keys. Therefore all foreign keys to the Sample table need to be removed first. Sample2 has a foreign key reference to the primary key, which will be removed by running the code in Listing 6.
Once all the foreign key constraints have been removed, the SWITCH
operation can be performed. The code in Listing 17 will perform the SWITCH
operation.
Listing 17: Switch tables
USE tempdb; GO ALTER TABLE Sample SWITCH TO Sample_New; SELECT * FROM Sample_New;
When the switch is performed, the partition between the source (Sample) and target (Sample_New) tables is switched. Only the metadata is changed; the data was not moved. The output of the SELECT
statement in Listing 17 can be found in Report 5. This output verifies that rows from the Sample table are now associated with the Sample_New table.
Report 5: Rows in Sample_New table
Once the SWITCH
operation has been performed, the old Sample table does not contain any rows. Therefore it can be deleted, and the new table can be renamed. The delete and rename can be performed by executing the code in Listing 18.
Listing 18: Drop old and rename new
USE tempdb; GO DROP TABLE Sample EXEC sp_rename N'Sample_New',N'Sample';
The final step of the migration is to recreate the foreign key deleted earlier. This key can be recreated by running the code in Listing 13.
To verify that this migration was successful, you can run the code in Listing 14. When this code is executed, it should produce the same results as shown in Report 4, with one exception. ID column is now in ordinal position 1. I’ll leave it up to you to run this code and to verify that the SWITCH
operation was successful in migrating from an identity column to a column populated with a sequence number object.
Identity column values cannot be updated, whereas columns populated with sequence numbers can be updated. If you want to make sure sequence numbers cannot be updated, then an AFTER
UPDATE
trigger will need to be created. To verify that the existing Sample table allows updates to the ID column, run the code in Listing 19.
Listing 19: Updating ID column
USE tempdb; GO BEGIN TRAN UPDATE Sample SET ID = ID + 100; SELECT * FROM Sample; ROLLBACK TRAN; GO
The code in Listing 19 runs successfully and produces the output in Report 6.
Report 6: Output when Listing 17 is run
By reviewing Report 6, you can see that the ID values were all updated from their original values. The code in Listing 19 contains BEGIN
TRAN
and END TRAN
statements to roll back these updates for the next test.
To make sure an UPDATE
statement to the ID
column in the Sample table cannot be performed, the AFTER
UPDATE
trigger in Listing 20 needs to be created.
Listing 20: UPDATE trigger
USE tempdb GO CREATE TRIGGER trg_UpdateSample ON Sample AFTER UPDATE AS BEGIN SET NOCOUNT ON; DECLARE @OriginalID int DECLARE @UpdatedID int SELECT @OriginalID =[ID] FROM deleted SELECT @UpdatedID =[ID] FROM inserted IF @OriginalID <> @UpdatedID BEGIN RAISERROR('Failed: Update performed on ID column', 16, 1); ROLLBACK TRANSACTION END END GO
To test out if this trigger works, run the code in Listing 21.
Listing 21: Testing if trigger keeps ID column from being updated.
USE tempdb; GO UPDATE Sample SET ID = ID + 100 WHERE ID = 1; GO
When the code in Listing 21 is run, the error message in Report 7 is produced
Report 7: Error when Listing 21 is executed
By creating the AFTER
UPDATE
trigger trg_UpdateSample, the code in Listing 21 could not update the ID
column.
Over time, you may find out that an identity column needs to be swapped out and replaced with a sequence number. This article provided a couple of examples of how to perform that swap. If you plan to replace an identity column with a column populated with a sequence number value, keep in mind sequence number columns can be updated. To restrict sequence columns from being updated, an AFTER
UPDATE
trigger needs to be defined to restrict the updates.
The post How to replace an identity column with a sequence number appeared first on Simple Talk.
]]>