Painless Refactoring of SQL Server Database Objects

Comments 0

Share to social media

It is quite common for perfectly-functioning SQL code to be knocked off its feet by a change to the underlying database schema, or to other objects that are used in the code. If we are “lucky,” the code will suddenly start producing error messages; if not, it may just silently start producing different results. In either case, once the problem is discovered, the process of adjusting all of the application code can be long and painstaking.

Fortunately, there are a few simple defensive techniques that take little time to implement, but yet may significantly reduce the possibility of such errors. The relatively small extra effort up front will save a lot of maintenance toil later.

This article will examine several examples of how changes to database objects can cause unexpected behavior in the code that accesses them, and discuss how to develop code that will not break, or behave unpredictably, as a result of such changes. Specifically, we will cover:

  • changes to the Primary or unique keys, and how to test and validate assumptions regarding the “uniqueness” of column data
  • changes to stored procedure signatures, and the importance of using explicitly named parameters
  • changes to columns, such as adding columns as well as modifying an existing column’s nullability, size or data type.

One of the main lessons to be learned is that if your implementation relies on a particular property of the underlying schema, such as the uniqueness of a given column, then you must document that assumption, preferably in your unit tests, and make sure it always holds true.

Refactoring a Primary or Unique Key

Changes to the keys in your tables should, hopefully, be rare, but they can cause trouble to the unwary when they happen. In a broader context, the defensive programmer should always fully document and test any assumptions about the underlying uniqueness of the column data, as I described in the Ambiguous Updates sections of my previous Basic Defensive Database Programming Techniques article. The following examples demonstrate what can happen to perfectly correct code when changes are made to the underlying unique or primary keys, thus invalidating assumptions in the code regarding the uniqueness of the column data. We’ll then discuss how a query against the system views, or use of @@ROWCOUNT, can detect if such assumptions are still true.

In Listing 1, we create a table, Customers, using a UNIQUE constraint to guarantee the uniqueness of phone numbers, and then populate it with some test data.

Listing 1: Creating the Customers table, with a UNIQUE constraint on the PhoneNumber column.

We need to implement a simple stored procedure, shown in Listing 2, which will allow users to find a customer based on their phone number, and set their customer status (regular, preferred, or VIP). If no customer exists for a given phone number, we don’t need to raise an exception; we simply do nothing.

Listing 2: The SetCustomerStatus stored procedure, which finds a customer by phone number and sets their status.

This implementation assumes that at most one customer has any given phone number. Clearly, right now, this assumption is true as it is guaranteed by the UNQ_Customers constraint.

Suppose, however, that at some later time we need to store data about customers from different countries. At this point, the phone number alone no longer uniquely identifies a customer, but the combination of country code and phone number does. In order to accommodate this requirement, our Customers table is altered to add the new column, CountryCode, and our UNQ_Customers constraint is modified so that it enforces uniqueness based on a combination of the CountryCode and PhoneNumber columns. These alterations are shown in Listing 3.

Listing 3: Adding a CountryCode column to the table and to the unique constraint.

Note that, in reality, we should have added a lookup table, dbo.CountryCodes, referred to by a FOREIGNKEY constraint. However, I’ve avoided a lookup table in this case, in favor of keeping the example simple.

At this point, our constraint is no longer enforcing the uniqueness of values in the PhoneNumber column, so we can insert a customer with an identical phone number to an existing customer, but with a different country code, as shown in Listing 4.

Listing 4: Wayne Miller has the same phone number as Darrell Ling, but with a different country code.

Our Stored procedure, however, is still working on the assumption that a customer can be uniquely identified by their phone number alone. Since this assumption is no longer valid, the stored procedure, in its current form, could erroneously update more than one row of data, as demonstrated in Listing 5.

Listing 5: The unchanged stored procedure modifies two rows instead of one.

Perhaps the most prevalent and damaging mistake made during the development of SQL code is a failure to define or recognize the assumptions on which the implementation relies. The result is code that is brittle, and liable to behave unpredictably when these assumptions are invalidated by changes to the underlying database objects.

Whenever we change our unique and/or primary keys, we need to review all the procedures that depend on the modified tables. However, the manual process of reviewing the potentially affected code is, like all manual processes, slow and prone to error. It may be more efficient to automate the process of identifying the modules that rely on particular assumptions about the underlying schema. Unit tests allow us to accomplish exactly that; we can easily, for example, write a unit test that succeeds if there is a UNIQUE constraint on the PhoneNumber column alone, and fails when this is no longer the case.

Using unit tests to document and test assumptions

Let’s translate the assumption that the PhoneNumber column uniquely identifies a customer into a query against the system views. The query is rather complex, so we’ll develop it in several steps. First of all, we need to know if there are any constraints on the PhoneNumber column, as shown in Listing 6.

Listing 6: Step 1, a query to check for constraints on PhoneNumber.

This query returns 1, confirming that there is a constraint on that column. Next, we need to verify that the constraint is either a primary key or a unique constraint:

Listing 7: Step 2 determines if the constraint on column PhoneNumber is a primary key or unique.

Finally, we need to make sure that no other columns are included in that UNIQUE or PRIMARYKEY constraint, as follows:

Listing 8: Step 3, the final query determines whether there is a unique or primary key constraint that is built on only the PhoneNumbercolumn.

When we run this query against the original database schema, with a UNIQUE constraint on the PhoneNumber column, it returns a value of 1 indicating that there is indeed a constraint built only on the PhoneNumber column. However, when we run it after the column CountryCode has been added to the definition of the unique constraint, the second subquery returns the value 2, which means that the UNIQUE constraint UNQ_Customers is built on two columns, and so the outer query returns a value of 0.

In short, this query provides us with a means to verify the validity of the assumption that the PhoneNumber column uniquely identifies a customer. By incorporating this query into our unit test harness, we can accomplish two goals:

  • our assumption is documented – the code in Listing 8 clearly documents the fact that the dbo.SetCustomerStatus stored procedure needs a unique or primary constraint on a single column, PhoneNumber
  • our assumption is tested – if the required constraint is dropped, or includes more than one column, we shall get a clear warning, because the unit test will fail.

Of course, we should wrap this query in a stored procedure and reuse it, because there will surely be other cases when we rely on the uniqueness of a column used in our search condition.

We can use a similar technique to verify whether or not a combination of columns, considered together, are guaranteed to be unique. Implementing this query is left as an exercise to the reader.

Using @@ROWCOUNT to verify assumptions

Alternatively, instead of documenting our assumption as a unit test, we can have our stored procedure detect how many rows it modified, and roll back if it updated more than one row, as shown in Listing 9.

Listing 9: A stored procedure that will not modify more than one row.

To see it in action, run Listing 10; the stored procedure raises an error and does not modify the data.

Listing 10: Testing the altered stored procedure.

In general, this approach could be useful although, in this particular case, it is less preferable than a unit test. The reason is very simple: a unit test will alert us about a problem before deployment, allowing us fix the problem early, and to deploy without this particular bug. The altered stored procedure might not indicate a problem until the code has been deployed to production, which means troubleshooting a production system and redeploying a fix; a situation we usually want to avoid.

Using SET instead of SELECT when assigning variables

It is important to understand the different behavior of SET and SELECT when assigning values to variables. For example, some developers do not realize that SELECT can leave a variable unchanged if the selected result set is empty, can lead to the dreaded infinite loop. If we’re using that variable to terminate the loop, we can end up in the dreaded ‘infinite loop’. This sort of knowledge will help you write application code that is resistant to changes to the underlying schema objects.

Let’s consider an example whereby a search condition contains an implied assumption regarding the uniqueness of the underlying data column. The search condition, and subsequent variable assignment, shown in Listing 11 assumes, again, that the PhoneNumber column can uniquely identify a customer.

Listing 11: Unpredictable variable assignment, using SELECT.

In our original database schema, before we added CountryCode column to the Customers table, the result of this assignment was predictable. However, in our new schema, the UNQ_Customers constraint only guarantees the uniqueness of the values in the PhoneNumber and CountryCode columns, considered together. As a result, we have two customers with this phone number and so the variable assignment is unpredictable; we do not, and cannot, know which of the two CustomerId values, 1 or 3, will populate the variable.

In most cases, such ambiguity is not acceptable. The simplest fix is to use SET instead of SELECT to populate the variable, as shown in Listing 12.

Listing 12: Whereas SELECT ignores the ambiguity, SET detects it and raises an error.

Refactoring the Signature of a Stored Procedure

Consider the stored procedure shown in Listing 13, SelectCustomersByName, which takes two optional search conditions, and selects data from the Customers table, as defined in Listing 1.

Listing 13: The SelectCustomersByName stored procedure.

When we invoke this stored procedure, we can explicitly name its parameters, and make the code more readable, but we are not forced to do so, as shown in Listing 14.

Listing 14: Two ways to invoke the SelectCustomersByName stored procedure.

At the moment, either way of invoking the stored procedure produces the same result. Suppose, however, that the signature of this stored procedure is subsequently modified to accept an optional @FirstName parameter, as described in Listing 15.

Listing 15: The modified SelectCustomersByName stored procedure includes an additional FirstName parameter.

As a result of this modification, the two ways of invoking the stored procedure are no longer equivalent. Of course, we will not receive any error message; we will just silently start getting different results, as shown in Listing 16.

Listing 16: The same stored procedure call is interpreted differently after the signature of that stored procedure has changed.

The lesson here is clear: stored procedure calls with explicitly named parameters are more robust; they continue to work correctly even when the signature of the stored procedure changes, or they give explicit errors instead of silently returning incorrect results.

Refactoring Columns

One of the most common causes of brittle code is a failure to program defensively against subsequent changes to the columns of the underlying data tables. These changes can take the form of adding columns, or changing the definition of existing columns, for example, their data type or size.

Of course, some changes are so serious that our code cannot survive them. For example, if a column that is required in a query is removed, then that is a breaking change that we can do nothing to protect against. However, in many other cases, we can develop code that is resilient to changes to the underlying columns. In this section, we’ll examine a few examples, explaining how to make our code more robust in each case.

Qualifying column names

It takes a few extra keystrokes to qualify column names when writing our queries, but these keystrokes pay healthy dividends in terms of the resilience of the resulting code. Consider the example tables created in Listing 17, Shipments and ShipmentItems, populated with sample data.

Listing 17: The Shipments and ShipmentItems tables.

Against this schema, we develop the query shown in Listing 18, which, for every shipment, selects its Barcode and calculates the number of shipment items with a ShipmentBarcode matching the Barcode for that shipment.

Listing 18: A correlated subquery that works correctly even though column names are not qualified.

The inner query is an example of a correlated subquery; it uses in its WHERE clause the Barcode column from the Shipments table in the outer query. Notice that the query works even though we failed to qualify the column names.

Yet the situation can change. Rather than just having a barcode to identify shipments, we start using barcodes to identify individual items in the shipment, so we need to add a Barcode column to the ShipmentItems table, as shown in Listing 19.

Listing 19: The query works differently when a Barcode column is added to ShipmentItems table.

We do not get any error messages; our query continues to work but silently changes its behavior. With the addition of the Barcode column to the ShipmentItems table, our query is interpreted quite differently. Now, for every shipment, it selects its barcode followed by the number of ShipmentItems whose Barcode value matches their ShipmentBarcode value. In other words, the correlated subquery becomes uncorrelated; the WHERE clause of the inner query no longer uses a value from the outer query.

It takes just a few moments to properly qualify all the column names in our query, and the improved query will continue to work correctly even after the addition of the Barcode column to our ShipmentItems table, as shown in Listing 20.

Listing 20: Qualified column names lead to more robust code.

Qualifying column names improves the robustness of our queries. The same technique also ensures that you get an error, instead of incorrect results, when a column is removed or when a column name is misspelled. For example, consider the case of an uncorrelated subquery that becomes correlated because a column from a table in the subquery is removed (or misspelled in the query), but happens to match a column in the outer query. Many developers forget that the parser will look in the outer query if it fails to find a match in the inner query.

Handling changes in nullability: NOT IN versus NOT EXISTS

Queries with NOTIN have a well known vulnerability. They do not work as an inexperienced database programmer might expect, if the subquery contained in the NOTIN clause returns at least one NULL. This is easy to demonstrate. In Listing 21, we recreate our ShipmentItems table with a Barcode column that does not accept NULLs, and then insert some fresh data. We then execute a query that uses the NOTIN clause.

Listing 21: Creating the new ShipmentItems table, populating it with some test data, and proving that the query using the NOT IN clause succeeds.

The query works as expected, and will continue to do so as long as the Barcode column disallows NULLs. However, let’s see what happens when we change the nullability of that column, as shown in Listing 22.

Listing 22: Now that the Barcode column accepts NULL, our NOTIN query no longer works as expected.

This can often seem like a very subtle bug; sometimes the query works as expected, but sometimes it does not. In fact, the behavior is completely consistent. Every time the subquery inside the NOTIN clause returns at least one NULL, then the query returns nothing. Listing 23 shows a much simpler script that demonstrates this behavior very clearly.

Listing 23: NOTIN queries will work differently when there are NULLs in the subquery.

This behavior may seem counterintuitive, but it actually makes perfect sense. Let me explain why, in just two simple steps. Listing 24 shows two queries. The first one uses an IN clause; the second is logically equivalent to the first, but the IN clause has been expressed using OR predicates.

Listing 24: A query with an IN clause, and a logically equivalent query using OR.

In the second step, we must consider the NOTIN version of our query, convert it to use OR predicates, and then apply DeMorgan’s law, which states that for the logical expressions P and Q:

NOT(P OR Q) = (NOT P) AND (NOT Q)

The result is shown in Listing 25.

Listing 25: Three equivalent queries, the first using NOT IN, the second using two OR predicates and the third one with two AND predicates.

Take note of the (1<>NULL) condition in the final query; by definition, the result of this condition is always unknown and, when joined with other conditions using AND, the result of the whole expression will always be false. This is why no row can ever qualify a NOTIN condition if the subquery inside that NOTIN returns at least one NULL.

Whenever we write a query using the NOT`IN clause, we implicitly assume that the subquery can never return a NULL. Usually, we do not need to rely on such assumptions, because it is very easy to develop correct queries without making them. For example, Listing 26 shows how to remove this assumption from our original SELECT query.

Listing 26: A query with a subquery that never returns any NULLs.

By adding just one short and simple line to our query, we have improved its robustness. Alternatively, we can develop a query that does not assume anything about the nullability of any columns, as shown in Listing 27.

Listing 27: An equivalent query with NOT EXISTS.

This query will work in the same way, regardless of whether or not the Barcode column is nullable.

Handling changes to data types and sizes

We frequently develop stored procedures with parameters, the types and lengths of which must match the types and lengths of some underlying columns. By way of an example, Listing 28 creates a Codes table, and populates it with test data. It then to creates a SelectCode stored procedure with one parameter, @Code,the type and length of which must match the type and length of the Code column in Codes table.

Listing 28: The Codes table and SelectCode stored procedure.

When we execute the stored procedure, supplying an appropriate value for the @Code parameter, the result is as expected, and is shown in Listing 29.

Listing 29: The SelectCode stored procedure works as expected.

Suppose, however, that we have to change the length of the Code column to accept longer values, as shown in Listing 30.

Listing 30: Increasing the length of the Code column and adding a row with maximum Code length.

However, the unmodified stored procedure still expects a VARCHAR(5) parameter, and it silently truncates any longer value, as shown in Listing 31.

Listing 31: The unchanged stored procedure retrieves the wrong row.

Such bugs are quite subtle, and may take considerable time to troubleshoot. How can we prevent such errors, except for manually referring to a data dictionary every time we change column types? The best way, in my opinion, is to document the requirement that the type and lengths of our stored procedure parameters must match the type and lengths of our columns and then incorporate this requirement into a boundary case unit test, to protect us from such errors.

Summary

We have seen how changes in database objects may break our code, and discussed several defensive techniques that improve the robustness of our code, so that it better survives changes in other database objects. Specifically, I hope you’ve learned the following lessons in defensive programming:

  • how to test the continuing validity of your assumptions about the characteristics of the underlying primary and unique keys
  • why using SET is generally safer than using SELECT when assigning values to variables
  • the importance of explicitly named parameters in stored procedure calls
  • the need to qualify column names in queries
  • why NOTEXISTS leads to more resilient queries than NOTIN
  • the importance of testing and validating even the “obvious” assumptions, such as the need for the types and length of parameters to match those of the underlying columns.

About the author

Alex Kuznetsov

See Profile

Alex Kuznetsov has been working with object oriented languages and databases for more than a decade. He has worked with Sybase, SQL Server, Oracle and DB2. Alex has published multiple articles on simple-talk.com and sqlblog.com and wrote a book entitled Defensive Database Programming with SQL Server. Currently he works in an agile team in Chicago. In his leisure time, Alex prepares for and runs ultramarathons.

Alex Kuznetsov's contributions