It is quite common for perfectly-functioning SQL code to be knocked off its feet by a change to the underlying database schema, or to other objects that are used in the code. If we are “lucky,” the code will suddenly start producing error messages; if not, it may just silently start producing different results. In either case, once the problem is discovered, the process of adjusting all of the application code can be long and painstaking.
Fortunately, there are a few simple defensive techniques that take little time to implement, but yet may significantly reduce the possibility of such errors. The relatively small extra effort up front will save a lot of maintenance toil later.
This article will examine several examples of how changes to database objects can cause unexpected behavior in the code that accesses them, and discuss how to develop code that will not break, or behave unpredictably, as a result of such changes. Specifically, we will cover:
- changes to the Primary or unique keys, and how to test and validate assumptions regarding the “uniqueness” of column data
- changes to stored procedure signatures, and the importance of using explicitly named parameters
- changes to columns, such as adding columns as well as modifying an existing column’s nullability, size or data type.
One of the main lessons to be learned is that if your implementation relies on a particular property of the underlying schema, such as the uniqueness of a given column, then you must document that assumption, preferably in your unit tests, and make sure it always holds true.
Refactoring a Primary or Unique Key
Changes to the keys in your tables should, hopefully, be rare, but they can cause trouble to the unwary when they happen. In a broader context, the defensive programmer should always fully document and test any assumptions about the underlying uniqueness of the column data, as I described in the Ambiguous Updates sections of my previous Basic Defensive Database Programming Techniques article. The following examples demonstrate what can happen to perfectly correct code when changes are made to the underlying unique or primary keys, thus invalidating assumptions in the code regarding the uniqueness of the column data. We’ll then discuss how a query against the system views, or use of @@ROWCOUNT
, can detect if such assumptions are still true.
In Listing 1, we create a table, Customers
, using a UNIQUE
constraint to guarantee the uniqueness of phone numbers, and then populate it with some test data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
CREATE TABLE dbo.Customers ( CustomerId INT NOT NULL , FirstName VARCHAR(50) NOT NULL , LastName VARCHAR(50) NOT NULL , Status VARCHAR(50) NOT NULL , PhoneNumber VARCHAR(50) NOT NULL , CONSTRAINT PK_Customers PRIMARY KEY ( CustomerId ) , CONSTRAINT UNQ_Customers UNIQUE ( PhoneNumber ) ) ; GO INSERT INTO dbo.Customers ( CustomerId , FirstName , LastName , Status , PhoneNumber ) SELECT 1 , 'Darrel' , 'Ling' , 'Regular' , '(123)456-7890' UNION ALL SELECT 2 , 'Peter' , 'Hansen' , 'Regular' , '(234)123-4567' ; |
We need to implement a simple stored procedure, shown in Listing 2, which will allow users to find a customer based on their phone number, and set their customer status (regular, preferred, or VIP). If no customer exists for a given phone number, we don’t need to raise an exception; we simply do nothing.
1 2 3 4 5 6 7 8 9 |
CREATE PROCEDURE dbo.SetCustomerStatus @PhoneNumber VARCHAR(50) , @Status VARCHAR(50) AS BEGIN; UPDATE dbo.Customers SET Status = @Status WHERE PhoneNumber = @PhoneNumber ; END ; |
This implementation assumes that at most one customer has any given phone number. Clearly, right now, this assumption is true as it is guaranteed by the UNQ_Customers
constraint.
Suppose, however, that at some later time we need to store data about customers from different countries. At this point, the phone number alone no longer uniquely identifies a customer, but the combination of country code and phone number does. In order to accommodate this requirement, our Customers
table is altered to add the new column, CountryCode
, and our UNQ_Customers
constraint is modified so that it enforces uniqueness based on a combination of the CountryCode
and PhoneNumber
columns. These alterations are shown in Listing 3.
1 2 3 4 5 6 7 8 9 10 11 12 |
ALTER TABLE dbo.Customers ADD CountryCode CHAR(2) NOT NULL CONSTRAINT DF_Customers_CountryCode DEFAULT('US') ; GO ALTER TABLE dbo.Customers DROP CONSTRAINT UNQ_Customers; GO ALTER TABLE dbo.Customers ADD CONSTRAINT UNQ_Customers UNIQUE(PhoneNumber, CountryCode) ; |
Note that, in reality, we should have added a lookup table, dbo.CountryCodes
, referred to by a FOREIGN
KEY
constraint. However, I’ve avoided a lookup table in this case, in favor of keeping the example simple.
At this point, our constraint is no longer enforcing the uniqueness of values in the PhoneNumber
column, so we can insert a customer with an identical phone number to an existing customer, but with a different country code, as shown in Listing 4.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
UPDATE dbo.Customers SET Status = 'Regular' ; INSERT INTO dbo.Customers ( CustomerId , FirstName , LastName , Status , PhoneNumber , CountryCode ) SELECT 3 , 'Wayne' , 'Miller' , 'Regular' , '(123)456-7890' , 'UK' ; |
Our Stored procedure, however, is still working on the assumption that a customer can be uniquely identified by their phone number alone. Since this assumption is no longer valid, the stored procedure, in its current form, could erroneously update more than one row of data, as demonstrated in Listing 5.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
-- at this moment all customers have Regular status EXEC dbo.SetCustomerStatus @PhoneNumber = '(123)456-7890', @Status = 'Preferred' ; -- the procedure has modified statuses of two customers SELECT CustomerId , Status FROM dbo.Customers ; CustomerId Status ----------- ------------- 1 Preferred 2 Regular 3 Preferred |
Perhaps the most prevalent and damaging mistake made during the development of SQL code is a failure to define or recognize the assumptions on which the implementation relies. The result is code that is brittle, and liable to behave unpredictably when these assumptions are invalidated by changes to the underlying database objects.
Whenever we change our unique and/or primary keys, we need to review all the procedures that depend on the modified tables. However, the manual process of reviewing the potentially affected code is, like all manual processes, slow and prone to error. It may be more efficient to automate the process of identifying the modules that rely on particular assumptions about the underlying schema. Unit tests allow us to accomplish exactly that; we can easily, for example, write a unit test that succeeds if there is a UNIQUE
constraint on the PhoneNumber
column alone, and fails when this is no longer the case.
Using unit tests to document and test assumptions
Let’s translate the assumption that the PhoneNumber
column uniquely identifies a customer into a query against the system views. The query is rather complex, so we’ll develop it in several steps. First of all, we need to know if there are any constraints on the PhoneNumber
column, as shown in Listing 6.
1 2 3 4 5 |
SELECT COUNT(*) FROM INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS u WHERE u.TABLE_NAME = 'Customers' AND u.TABLE_SCHEMA = 'dbo' AND u.COLUMN_NAME = 'PhoneNumber' ; |
This query returns 1, confirming that there is a constraint on that column. Next, we need to verify that the constraint is either a primary key or a unique constraint:
1 2 3 4 5 6 7 8 9 10 11 |
SELECT COUNT(*) FROM INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS u JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS c ON c.TABLE_NAME = u.TABLE_NAME AND c.TABLE_SCHEMA = u.TABLE_SCHEMA AND c.CONSTRAINT_NAME = u.CONSTRAINT_NAME WHERE u.TABLE_NAME = 'Customers' AND u.TABLE_SCHEMA = 'dbo' AND u.COLUMN_NAME = 'PhoneNumber' AND c.CONSTRAINT_TYPE IN ( 'PRIMARY KEY', 'UNIQUE' ) ; |
Finally, we need to make sure that no other columns are included in that UNIQUE
or PRIMARY
KEY
constraint, as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
SELECT COUNT(*) FROM INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS u JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS c ON c.TABLE_NAME = u.TABLE_NAME AND c.TABLE_SCHEMA = u.TABLE_SCHEMA AND c.CONSTRAINT_NAME = u.CONSTRAINT_NAME WHERE u.TABLE_NAME = 'Customers' AND u.TABLE_SCHEMA = 'dbo' AND u.COLUMN_NAME = 'PhoneNumber' AND c.CONSTRAINT_TYPE IN ( 'PRIMARY KEY', 'UNIQUE' ) -- this constraint involves only one column AND ( SELECT COUNT(*) FROM INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS u1 WHERE u1.TABLE_NAME = u.TABLE_NAME AND u1.TABLE_SCHEMA = u.TABLE_SCHEMA AND u1.CONSTRAINT_NAME = u.CONSTRAINT_NAME ) = 1 ; |
When we run this query against the original database schema, with a UNIQUE
constraint on the PhoneNumber
column, it returns a value of 1 indicating that there is indeed a constraint built only on the PhoneNumber
column. However, when we run it after the column CountryCode
has been added to the definition of the unique constraint, the second subquery returns the value 2, which means that the UNIQUE
constraint UNQ_Customers
is built on two columns, and so the outer query returns a value of 0.
In short, this query provides us with a means to verify the validity of the assumption that the PhoneNumber
column uniquely identifies a customer. By incorporating this query into our unit test harness, we can accomplish two goals:
- our assumption is documented – the code in Listing 8 clearly documents the fact that the
dbo.SetCustomerStatus
stored procedure needs a unique or primary constraint on a single column,PhoneNumber
- our assumption is tested – if the required constraint is dropped, or includes more than one column, we shall get a clear warning, because the unit test will fail.
Of course, we should wrap this query in a stored procedure and reuse it, because there will surely be other cases when we rely on the uniqueness of a column used in our search condition.
We can use a similar technique to verify whether or not a combination of columns, considered together, are guaranteed to be unique. Implementing this query is left as an exercise to the reader.
Using @@ROWCOUNT to verify assumptions
Alternatively, instead of documenting our assumption as a unit test, we can have our stored procedure detect how many rows it modified, and roll back if it updated more than one row, as shown in Listing 9.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
ALTER PROCEDURE dbo.SetCustomerStatus @PhoneNumber VARCHAR(50) , @Status VARCHAR(50) AS BEGIN ; BEGIN TRANSACTION ; UPDATE dbo.Customers SET Status = @Status WHERE PhoneNumber = @PhoneNumber ; IF @@ROWCOUNT > 1 BEGIN ; ROLLBACK ; RAISERROR('More than one row updated', 16, 1) ; END ; ELSE BEGIN ; COMMIT ; END ; END ; |
To see it in action, run Listing 10; the stored procedure raises an error and does not modify the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
UPDATE dbo.Customers SET Status = 'Regular' ; EXEC dbo.SetCustomerStatus @PhoneNumber = '(123)456-7890', @Status = 'Preferred' ; Msg 50000, Level 16, State 1, Procedure SetCustomerStatus, Line 15 More than one row updated -- verify if the procedure has modified any data SELECT CustomerId , Status FROM dbo.Customers ; |
In general, this approach could be useful although, in this particular case, it is less preferable than a unit test. The reason is very simple: a unit test will alert us about a problem before deployment, allowing us fix the problem early, and to deploy without this particular bug. The altered stored procedure might not indicate a problem until the code has been deployed to production, which means troubleshooting a production system and redeploying a fix; a situation we usually want to avoid.
Using SET instead of SELECT when assigning variables
It is important to understand the different behavior of SET
and SELECT
when assigning values to variables. For example, some developers do not realize that SELECT
can leave a variable unchanged if the selected result set is empty, can lead to the dreaded infinite loop. If we’re using that variable to terminate the loop, we can end up in the dreaded ‘infinite loop’. This sort of knowledge will help you write application code that is resistant to changes to the underlying schema objects.
Let’s consider an example whereby a search condition contains an implied assumption regarding the uniqueness of the underlying data column. The search condition, and subsequent variable assignment, shown in Listing 11 assumes, again, that the PhoneNumber
column can uniquely identify a customer.
1 2 3 4 5 6 7 8 9 |
DECLARE @CustomerId INT ; SELECT @CustomerId = CustomerId FROM dbo.Customers WHERE PhoneNumber = '(123)456-7890' ; SELECT @CustomerId AS CustomerId ; -- Do something with CustomerId |
In our original database schema, before we added CountryCode
column to the Customers
table, the result of this assignment was predictable. However, in our new schema, the UNQ_Customers
constraint only guarantees the uniqueness of the values in the PhoneNumber
and CountryCode
columns, considered together. As a result, we have two customers with this phone number and so the variable assignment is unpredictable; we do not, and cannot, know which of the two CustomerId
values, 1 or 3, will populate the variable.
In most cases, such ambiguity is not acceptable. The simplest fix is to use SET
instead of SELECT
to populate the variable, as shown in Listing 12.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
DECLARE @CustomerId INT ; -- this assignment will succeed, -- because in this case there is no ambiguity SET @CustomerId = ( SELECT CustomerId FROM dbo.Customers WHERE PhoneNumber = '(234)123-4567' ) ; SELECT @CustomerId AS CustomerId ; CustomerId ----------- 2 -- this assignment will fail, -- because there is ambiguity, -- two customers have the same phone number SET @CustomerId = ( SELECT CustomerId FROM dbo.Customers WHERE PhoneNumber = '(123)456-7890' ) ; Msg 512, Level 16, State 1, Line 16 Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression. -- the above error must be intercepted and handled. See: -- http://bit.ly/1EG7wkL -- the variable is left unchanged SELECT @CustomerId AS CustomerId ; CustomerId ----------- 2 |
Refactoring the Signature of a Stored Procedure
Consider the stored procedure shown in Listing 13, SelectCustomersByName
,
which takes two optional search conditions, and selects data from the Customers
table, as defined in Listing 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
CREATE PROCEDURE dbo.SelectCustomersByName @LastName VARCHAR(50) = NULL , @PhoneNumber VARCHAR(50) = NULL AS BEGIN ; SELECT CustomerId , FirstName , LastName , PhoneNumber , Status FROM dbo.Customers WHERE LastName = COALESCE(@LastName, LastName) AND PhoneNumber = COALESCE(@PhoneNumber, PhoneNumber) ; END ; |
When we invoke this stored procedure, we can explicitly name its parameters, and make the code more readable, but we are not forced to do so, as shown in Listing 14.
1 2 3 4 5 6 7 |
EXEC dbo.SelectCustomersByName 'Hansen', -- @LastName '(234)123-4567' ; -- @PhoneNumber EXEC dbo.SelectCustomersByName @LastName = 'Hansen', @PhoneNumber = '(234)123-4567' ; |
At the moment, either way of invoking the stored procedure produces the same result. Suppose, however, that the signature of this stored procedure is subsequently modified to accept an optional @
FirstName
parameter, as described in Listing 15.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
ALTER PROCEDURE dbo.SelectCustomersByName @FirstName VARCHAR(50) = NULL , @LastName VARCHAR(50) = NULL , @PhoneNumber VARCHAR(50) = NULL AS BEGIN ; SELECT CustomerId , FirstName , LastName , PhoneNumber , Status FROM dbo.Customers WHERE FirstName = COALESCE (@FirstName, FirstName) AND LastName = COALESCE (@LastName,LastName) AND PhoneNumber = COALESCE (@PhoneNumber, PhoneNumber) ; END ; GO |
As a result of this modification, the two ways of invoking the stored procedure are no longer equivalent. Of course, we will not receive any error message; we will just silently start getting different results, as shown in Listing 16.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
-- in the new context this call is interpreted -- differently. It will return no rows EXEC dbo.SelectCustomersByName 'Hansen', -- @FirstName '(234)123-4567' ; -- @LastName -- this stored procedure call is equivalent -- to the previous one EXEC dbo.SelectCustomersByName @FirstName = 'Hansen', @LastName = '(234)123-4567' ; -- this call returns the required row EXEC dbo.SelectCustomersByName @LastName = 'Hansen', @PhoneNumber = '(234)123-4567' ; |
The lesson here is clear: stored procedure calls with explicitly named parameters are more robust; they continue to work correctly even when the signature of the stored procedure changes, or they give explicit errors instead of silently returning incorrect results.
Refactoring Columns
One of the most common causes of brittle code is a failure to program defensively against subsequent changes to the columns of the underlying data tables. These changes can take the form of adding columns, or changing the definition of existing columns, for example, their data type or size.
Of course, some changes are so serious that our code cannot survive them. For example, if a column that is required in a query is removed, then that is a breaking change that we can do nothing to protect against. However, in many other cases, we can develop code that is resilient to changes to the underlying columns. In this section, we’ll examine a few examples, explaining how to make our code more robust in each case.
Qualifying column names
It takes a few extra keystrokes to qualify column names when writing our queries, but these keystrokes pay healthy dividends in terms of the resilience of the resulting code. Consider the example tables created in Listing 17, Shipments
and ShipmentItems
, populated with sample data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
CREATE TABLE dbo.Shipments ( Barcode VARCHAR(30) NOT NULL PRIMARY KEY, SomeOtherData VARCHAR(100) NULL ) ; GO INSERT INTO dbo.Shipments ( Barcode , SomeOtherData ) SELECT '123456' , '123456 data' UNION ALL SELECT '123654' , '123654 data' ; GO CREATE TABLE dbo.ShipmentItems ( ShipmentBarcode VARCHAR(30) NOT NULL, Description VARCHAR(100) NULL ) ; GO INSERT INTO dbo.ShipmentItems ( ShipmentBarcode , Description ) SELECT '123456' , 'Some cool widget' UNION ALL SELECT '123456' , 'Some cool stuff for some gadget' ; GO |
Against this schema, we develop the query shown in Listing 18, which, for every shipment, selects its Barcode
and calculates the number of shipment items with a ShipmentBarcode
matching the Barcode
for that shipment.
1 2 3 4 5 6 7 8 9 10 11 |
SELECT Barcode , ( SELECT COUNT(*) FROM dbo.ShipmentItems WHERE ShipmentBarcode = Barcode ) AS NumItems FROM dbo.Shipments ; Barcode NumItems --------------------------------------- 123456 2 123654 0 |
The inner query is an example of a correlated subquery; it uses in its WHERE
clause the Barcode
column from the Shipments
table in the outer query. Notice that the query works even though we failed to qualify the column names.
Yet the situation can change. Rather than just having a barcode to identify shipments, we start using barcodes to identify individual items in the shipment, so we need to add a Barcode
column to the ShipmentItems
table, as shown in Listing 19.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
ALTER TABLE dbo.ShipmentItems ADD Barcode VARCHAR(30) NULL ; GO SELECT Barcode , ( SELECT COUNT(*) FROM dbo.ShipmentItems WHERE ShipmentBarcode = Barcode ) AS NumItems FROM dbo.Shipments ; Barcode NumItems --------------------------------------- 123456 0 123654 0 |
We do not get any error messages; our query continues to work but silently changes its behavior. With the addition of the Barcode
column to the ShipmentItems
table, our query is interpreted quite differently. Now, for every shipment, it selects its barcode followed by the number of ShipmentItems
whose Barcode
value matches their ShipmentBarcode
value. In other words, the correlated subquery becomes uncorrelated; the WHERE
clause of the inner query no longer uses a value from the outer query.
It takes just a few moments to properly qualify all the column names in our query, and the improved query will continue to work correctly even after the addition of the Barcode
column to our ShipmentItems
table, as shown in Listing 20.
1 2 3 4 5 6 7 8 9 10 11 |
SELECT s.Barcode , ( SELECT COUNT(*) FROM dbo.ShipmentItems AS i WHERE i.ShipmentBarcode = s.Barcode ) AS NumItems FROM dbo.Shipments AS s ; Barcode NumItems --------------------------------------- 123456 2 123654 0 |
Qualifying column names improves the robustness of our queries. The same technique also ensures that you get an error, instead of incorrect results, when a column is removed or when a column name is misspelled. For example, consider the case of an uncorrelated subquery that becomes correlated because a column from a table in the subquery is removed (or misspelled in the query), but happens to match a column in the outer query. Many developers forget that the parser will look in the outer query if it fails to find a match in the inner query.
Handling changes in nullability: NOT IN versus NOT EXISTS
Queries with NOT
IN
have a well known vulnerability. They do not work as an inexperienced database programmer might expect, if the subquery contained in the NOT
IN
clause returns at least one NULL
. This is easy to demonstrate. In Listing 21, we recreate our ShipmentItems
table with a Barcode
column that does not accept NULL
s, and then insert some fresh data. We then execute a query that uses the NOT
IN
clause.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
DROP TABLE dbo.ShipmentItems ; GO CREATE TABLE dbo.ShipmentItems ( ShipmentBarcode VARCHAR(30) NOT NULL , Description VARCHAR(100) NULL , Barcode VARCHAR(30) NOT NULL ) ; GO INSERT INTO dbo.ShipmentItems ( ShipmentBarcode , Barcode , Description ) SELECT '123456' , '1010203' , 'Some cool widget' UNION ALL SELECT '123654' , '1010203' , 'Some cool widget' UNION ALL SELECT '123654' , '1010204' , 'Some cool stuff for some gadget' ; GO -- retrieve all the items from shipment 123654 -- that are not shipped in shipment 123456 SELECT Barcode FROM dbo.ShipmentItems WHERE ShipmentBarcode = '123654' AND Barcode NOT IN ( SELECT Barcode FROM dbo.ShipmentItems WHERE ShipmentBarcode = '123456' ) ; Barcode ------------------------------ 1010204 |
The query works as expected, and will continue to do so as long as the Barcode
column disallows NULL
s. However, let’s see what happens when we change the nullability of that column, as shown in Listing 22.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
ALTER TABLE dbo.ShipmentItems ALTER COLUMN Barcode VARCHAR(30) NULL ; INSERT INTO dbo.ShipmentItems ( ShipmentBarcode , Barcode , Description ) SELECT '123456' , NULL , 'Users manual for some gadget' ; GO SELECT Barcode FROM dbo.ShipmentItems WHERE ShipmentBarcode = '123654' AND Barcode NOT IN ( SELECT Barcode FROM dbo.ShipmentItems WHERE ShipmentBarcode = '123456' ) ; Barcode ------------------------------ (0 row(s) affected) |
This can often seem like a very subtle bug; sometimes the query works as expected, but sometimes it does not. In fact, the behavior is completely consistent. Every time the subquery inside the NOT
IN
clause returns at least one NULL
, then the query returns nothing. Listing 23 shows a much simpler script that demonstrates this behavior very clearly.
1 2 3 4 5 6 7 8 9 |
SELECT CASE WHEN 1 NOT IN ( 2, 3 ) THEN 'True' ELSE 'Unknown or False' END , CASE WHEN 1 NOT IN ( 2, 3, NULL ) THEN 'True' ELSE 'Unknown or False' END ; ---------------------- True Unknown or False |
This behavior may seem counterintuitive, but it actually makes perfect sense. Let me explain why, in just two simple steps. Listing 24 shows two queries. The first one uses an IN
clause; the second is logically equivalent to the first, but the IN
clause has been expressed using OR
predicates.
1 2 3 4 5 6 7 8 9 10 |
-- A query using the IN clause: SELECT CASE WHEN 1 IN ( 1, 2, NULL ) THEN 'True' ELSE 'Unknown or False' END ; -- its logical eqiuvalent using OR SELECT CASE WHEN ( 1 = 1 ) OR ( 1 = 2 ) OR ( 1 = NULL ) THEN 'True' ELSE 'Unknown or False' END ; |
In the second step, we must consider the NOT
IN
version of our query, convert it to use OR
predicates, and then apply DeMorgan’s law, which states that for the logical expressions P and Q:
NOT(
P OR Q) = (NOT P) AND (NOT Q)
The result is shown in Listing 25.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
-- A query using the NOT IN clause: SELECT CASE WHEN 1 NOT IN ( 1, 2, NULL ) THEN 'True' ELSE 'Unknown or False' END ; -- its logical eqiuvalent using OR SELECT CASE WHEN NOT ( ( 1 = 1 ) OR ( 1 = 2 ) OR ( 1 = NULL ) ) THEN 'True' ELSE 'Unknown or False' END ; -- applying DeMorgan's law, replacing every OR with AND, -- and every = with <> SELECT CASE WHEN ( ( 1 <> 1 ) AND ( 1 <> 2 ) AND ( 1 <> NULL ) ) THEN 'True' ELSE 'Unknown or False' END ; |
Take note of the (1<>NULL)
condition in the final query; by definition, the result of this condition is always unknown and, when joined with other conditions using AND
, the result of the whole expression will always be false. This is why no row can ever qualify a NOT
IN
condition if the subquery inside that NOT
IN
returns at least one NULL
.
Whenever we write a query using the NOT
`IN
clause, we implicitly assume that the subquery can never return a NULL
. Usually, we do not need to rely on such assumptions, because it is very easy to develop correct queries without making them. For example, Listing 26 shows how to remove this assumption from our original SELECT
query.
1 2 3 4 5 6 7 |
SELECT Barcode FROM dbo.ShipmentItems WHERE ShipmentBarcode = '123654' AND Barcode NOT IN ( SELECT Barcode FROM dbo.ShipmentItems WHERE ShipmentBarcode = '123456' AND Barcode IS NOT NULL ) ; |
By adding just one short and simple line to our query, we have improved its robustness. Alternatively, we can develop a query that does not assume anything about the nullability of any columns, as shown in Listing 27.
1 2 3 4 5 6 7 8 9 |
-- retrieve all the items from shipment 123654 -- that are not shipped in shipment 123456 SELECT i.Barcode FROM dbo.ShipmentItems AS i WHERE i.ShipmentBarcode = '123654' AND NOT EXISTS ( SELECT * FROM dbo.ShipmentItems AS i1 WHERE i1.ShipmentBarcode = '123456' AND i1.Barcode = i.Barcode ) ; |
This query will work in the same way, regardless of whether or not the Barcode
column is nullable.
Handling changes to data types and sizes
We frequently develop stored procedures with parameters, the types and lengths of which must match the types and lengths of some underlying columns. By way of an example, Listing 28 creates a Codes
table, and populates it with test data. It then to creates a SelectCode
stored procedure with one parameter, @Code
,the type and length of which must match the type and length of the Code
column in Codes
table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
DROP TABLE dbo.Codes GO CREATE TABLE dbo.Codes ( Code VARCHAR(5) NOT NULL , Description VARCHAR(40) NOT NULL , CONSTRAINT PK_Codes PRIMARY KEY ( Code ) ) ; GO INSERT INTO dbo.Codes ( Code , Description ) VALUES ( '12345' , 'Description for 12345' ) ; INSERT INTO dbo.Codes ( Code , Description ) VALUES ( '34567' , 'Description for 34567' ) ; GO CREATE PROCEDURE dbo.SelectCode -- clearly the type and length of this parameter -- must match the type and length of Code column -- in dbo.Codes table @Code VARCHAR(5) AS SELECT Code , Description FROM dbo.Codes WHERE Code = @Code ; GO |
When we execute the stored procedure, supplying an appropriate value for the @Code
parameter, the result is as expected, and is shown in Listing 29.
1 2 3 4 5 |
EXEC dbo.SelectCode @Code = '12345' ; Code Description ---------- ---------------------------------------- 12345 Description for 12345 |
Suppose, however, that we have to change the length of the C
ode
column to accept longer values, as shown in Listing 30.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
ALTER TABLE dbo.Codes DROP CONSTRAINT PK_Codes ; GO ALTER TABLE dbo.Codes ALTER COLUMN Code VARCHAR(10) NOT NULL ; GO ALTER TABLE dbo.Codes ADD CONSTRAINT PK_Codes PRIMARY KEY(Code) ; GO INSERT INTO dbo.Codes ( Code , Description ) VALUES ( '1234567890' , 'Description for 1234567890' ) ; |
However, the unmodified stored procedure still expects a VARCHAR(
5)
parameter, and it silently truncates any longer value, as shown in Listing 31.
1 2 3 4 5 |
EXEC dbo.SelectCode @Code = '1234567890' ; Code Description ---------- --------------------- 12345 Description for 12345 |
Such bugs are quite subtle, and may take considerable time to troubleshoot. How can we prevent such errors, except for manually referring to a data dictionary every time we change column types? The best way, in my opinion, is to document the requirement that the type and lengths of our stored procedure parameters must match the type and lengths of our columns and then incorporate this requirement into a boundary case unit test, to protect us from such errors.
Summary
We have seen how changes in database objects may break our code, and discussed several defensive techniques that improve the robustness of our code, so that it better survives changes in other database objects. Specifically, I hope you’ve learned the following lessons in defensive programming:
- how to test the continuing validity of your assumptions about the characteristics of the underlying primary and unique keys
- why using
SET
is generally safer than usingSELECT
when assigning values to variables - the importance of explicitly named parameters in stored procedure calls
- the need to qualify column names in queries
- why
NOT
EXISTS
leads to more resilient queries thanNOT
IN
- the importance of testing and validating even the “obvious” assumptions, such as the need for the types and length of parameters to match those of the underlying columns.
Load comments