The Basics of Good T-SQL Coding Style – Part 4: Performance

There are several obvious problems with poor SQL Coding habits. It can make code difficult to maintain, or can confuse your team colleagues. It can make refactoring a chore or make testing difficult. The most serious problem is poor performance. You can write SQL that looks beautiful but performs sluggishly, or interferes with other threads. A busy database developer adopts good habits so as to avoid staring at execution plans. Rob Sheldon gives some examples.

The series so far:

  1. The Basics of Good T-SQL Coding Style
  2. The Basics of Good T-SQL Coding Style – Part 2: Defining Database Objects
  3. The Basics of Good T-SQL Coding Style – Part 3: Querying and Manipulating Data
  4. The Basics of Good T-SQL Coding Style – Part 4: Performance

When working with SQL Server, performance should be at the forefront, whether you are designing databases, managing indexes, or writing T-SQL queries. Each of these three factors plays a role in delivering applications that will perform over the long haul. Although one factor is as important as the next, we focus only on the queries component in this article, as part of our continued discussion about the basics of good coding.

Writing queries can be tricky because you need to continuously evaluate your code for performance-related issues, taking into account a wide range of considerations. For example, you should consider how you’re using temporary tables, outer joins, comparison expressions, correlated subqueries, and a variety of other elements.

Throughout the article, I provide a number of examples of T-SQL queries to help our discussion along. The examples are based on the AdventureWorks2014 database, installed on a local instance of SQL Server 2016. I’ve tried to hit as many of the more salient issues as possible, but keep in mind that a discussion focused on performance is much broader than what we can cover here, even though we’re limiting it only to T-SQL queries.

Also be aware that performance-related issues are seldom black-and-white. More often than not, your final approach to a query will be based on your specific circumstances and the data you’re working with. You might have to try different approaches before coming up with the best one for your particular circumstances, as they currently exist and will in the foreseeable future. Perhaps the way to think of this article is as a set of suggested guidelines, without any assumptions of absolute truth.

The ubiquitous table source

At the heart of most queries are one or more table sources (tables, views, derived tables or joined tables) that make it possible to get at the data you need in order to produce the necessary results. The tables will come in many forms. A query might use temporary tables, common table expressions (CTEs), table variables, Table functions, Table parameters (TVPs) or joins to provide one or more workable data sets that can be massaged and transformed before returning the final results. In fact, any query that contains a FROM clause is starting out with a table source.

Given that T-SQL is a set-based language, thinking in terms of table sources makes sense. But working with these structures is not always a straightforward process when it comes to performance. Let’s start with an example of a temporary table. The following T-SQL code creates the #sales temporary table and joins it with the Person table in the AdventureWorks2014 database:

The example is fairly straightforward. We’re simply aggregating the sales totals and mapping that information to the applicable sales rep. The SELECT statement returns the results shown in the following table:

FirstName LastName TotalSales
Syed Abbas 172524.4512
Amy Alberts 732759.1841
Pamela Ansman-Wolfe 3325102.5941
Michael Blythe 9293903.0046
David Campbell 3729945.349
Jillian Carson 10065803.5404
Shu Ito 6427005.554
Stephen Jiang 1092123.8561
Tete Mensa-Annan 2312545.69
Linda Mitchell 10367007.4265
Jae Pak 8503338.6457
Tsvi Reiter 7171012.7501
José Saraiva 5926418.3555
Lynn Tsoflias 1421810.9242
Rachel Valdez 1827066.7118
Garrett Vargas 3609447.2148
Ranjit Varkey Chudukatil 4509888.9311

The reason I’ve included this example is not to demonstrate how to create or use temporary tables, but to point out the larger issue of choosing carefully which table structure to use in a given situation, taking into account the ramifications on performance, along with other considerations.

There are, in fact, a number of important differences between temporary tables and table variables. The point of all this is that you need to fully understand how each one works before implementing them in your code. Only then can you choose the one best suited for your particular circumstances.

You should also keep in mind these table sources for less common use cases. For example, if you need to create a join on a large table and you plan to filter the data in that table, you might consider creating a temporary table to hold the subset of data and then joining to the temporary table as a way to boost performance.

Thinking in terms of data sets can often help you recast a T-SQL statement to perform better. The database engine works best with sets of data, unlike the engines used for procedural languages. Take for example, the following SELECT statement, which includes a correlated subquery in the WHERE clause:

The SELECT statement returns the total sales for each sales rep with a pay rate greater than 25, giving us the results shown in the following table:

SalesPersonID TotalSales
285 172524.4512
287 732759.1841
274 1092123.8561

The statement uses a correlated subquery to retrieve the BusinessEntityID value for each rep with a pay rate that exceeds 25. As with any correlated subquery, this one depends on the outer query for at least one of its values, in this case, SalesPersonID. Because of this construction, the query in the subquery must run for each row returned by the outer query. For a small table, this is not a big deal, but when we start talking of millions of rows, performance can go down the tubes.

One way to get around this is to join the two tables, rather than using a correlated subquery. This lets us work with the data in sets, rather than having to perform many individual operations. If the source table is large, and you’ll be filtering the data, you can take the approach described above, which is to create a temporary table that contains a subset of data, and then join on that.

Note, however, that we should be evaluating our code not only in terms of how to use certain types of data sources, but also when not to use them. For example, the following SELECT statement uses a left outer join to retrieve products that are not listed as product assembly components:

In this case, the statement returns 266 rows out of 504 total products, which are relatively small amounts. But if we were talking millions of rows, we would quickly see the impact on performance. In such cases, we might consider an approach other than joining the Product table to the BillOfMaterials table. When we join the tables, the database engine loads all 2,679 rows from the BillOfMaterials table before filtering out the rows whose ProductAssemblyID column is not null, which are the majority of rows.

Instead of defining a join to the BillOfMaterials table, we can add a condition to the WHERE clause that uses the NOT EXISTS operator and a subquery to check values in the BillOfMaterials table. In this way, we’re simplifying our data set in favor of a more robust WHERE expression.

In some cases, we might want to move in the opposite direction, turning expression logic into a data set. For example, the following T-SQL code creates a function and uses that function in a query’s select list:

Whenever we include a scalar function in an expression, we have to keep in mind its impact on performance. As with the correlated subquery, the database engine will execute the query within the function for every row in the outer query.

One solution is to turn the scalar function into an inline table-valued function. We would then use the function within the FROM clause, using the CROSS APPLY operator to join the SalesOrderHeader table to the function. This way, we can eliminate a great deal of unnecessary processing and consequently deliver better performance.

When writing T-SQL queries, we should always evaluate the statement’s elements to ensure that we’re using the data sets in a way to maximize performance, whether creating temporary table structures, joining data sets, or mixing things up in some other fashion.

Building comparison expressions

Like other elements of T-SQL coding, the ways in which we set up the conditional expressions in our WHERE clauses can impact performance. For example, suppose we use the NOT IN operator in a conditional expression:

As with most performance-related issues, using the NOT IN operator for a small data set is usually not a problem, but can be so for a larger one. This is because the database engine uses a table scan to find the data, rather than an index seek, even if an index is defined on the filtering column. Instead, consider using NOT EXISTS or perhaps joining the tables.

You can run into similar issues if you wrap the filtering column in a function, as in the following SELECT statement:

In this case, the fnGetProductName function retrieves the product name, based on the ProductID value, and them compares that name to the string expression. Because of this construction, the database engine is again likely to do a table scan, rather than an index seek, even if an index exists.

Another performance hole is to create a conditional expression that uses the LIKE operator in conjunction with specifying a wildcard at the beginning of the specified value:

The issue is not so much with the use of a wildcard, but rather with its placement at the beginning of the search value. In this position, the database engine must again do a complete table scan, reading every row of the Product table because it has no other way of ensuring that all potential values are returned.

When using the WHERE clause to filter data, keep in mind its impact on performance and always evaluate whether there is a better way of doing something. Even using a not equal comparison operator (<>) in a conditional expression can result in performance issues if the database engine cannot use the indexes effectively when generating a query plan. Try using a different construction to see whether that improves performance.

Returning result sets

When defining a query, you should ensure that you’re returning results as efficiently as possible. Even if a statement is technically correct and does not cause the database engine to balk, you might still inadvertently introduce performance issues. For example, the following CREATE PROCEDURE statement defines a very simple stored procedure that includes all the basic components, but fails to include a SET NOCOUNT ON clause:

By including a SET NOCOUNT ON clause at the beginning of your procedure, you’re telling the database engine to forego sending information messages back to the client for each executed statement, helping to improve application performance. Although this can seem a small step to take, it will in fact help and should be included in most of your stored procedures, unless you have a specific reason for returning those messages.

Another consideration when it comes to stored procedures is to maximize the benefit you get with each procedure by building them to return multiple result sets, when appropriate. This can reduce the number of connections and calls between the client and server. Just be sure that if you’re working with a data framework, it can handle multiple result sets.

When retrieving data, whether or not it is within a stored procedure, you should also be aware of how you use operators and other options within your statements. A good example of this is the UNION operator, which lets you join similarly structured data sets, as in the following example:

When you specify the UNION operator as I have here, the database engine combines the data sets, sorts the data in ascending order, and removes the duplicates. Not surprisingly, sorting and deduplication operations can add significant overhead, especially for large data sets. In some cases, you might not need to sort the data and remove duplicates. For example, the SELECT statements above return only unique rows because the BusinessEntityID column is included, which is the primary key. In such cases, we can add ALL to the UNION operator (as in UNION ALL) to join the data sets without incurring the sorting and deduplication overhead.

An issue related to this is the use of the DISTINCT keyword at the beginning of the select list, as in the following example:

The DISTINCT keyword, as with the UNION operator, sorts the data and removes duplicates, which is fine if you’re working with only a small amount of data, but it can significantly impact performance on large data sets, depending how the data has been indexed. In general, try to avoid DISTINCT for large data sets if you don’t really need it, as is the case in our example (once again, because we’ve included the BusinessEntityID column in the select list). If we still need to sort the data, we can add an ORDER BY clause.

One other issue worth pointing out is the use of the asterisk wildcard in place of a query’s select list, as in the following example:

As a general rule, you should avoid using the wildcard and instead specify the columns so you’re returning only the data you need. Why incur extra overhead to return unnecessary data? Besides, specifying the columns can also help prevent your applications from breaking, should the table structure change.

Proceed with caution

None of the issues we covered in the previous section are absolutes. At times, you might want to omit the SET NOCOUNT ON clause or use the UNION operator without ALL or add DISTINCT to your select list. You might even want to use a wildcard in you select list, at least when running ad hoc statements. The point is, you should proceed with caution whenever using any of these.

There are also other T-SQL elements that warrant such careful consideration, such as the use of dynamic SQL, as in the following example:

You’ll often see warnings about using dynamic SQL because of security risks (i.e., SQL injection attacks), but dynamic SQL can have performance implications as well, depending on how statements are constructed and executed. This is a huge topic, however, and well beyond the scope of this article, so be sure you do your homework before implementing dynamic SQL, for both security and performance reasons.

Also be cautious about defining triggers that return results to the application. An application that issues a request to modify data normally does not expect results, such as those in the following example:

The trigger’s main purpose is to log information to the EventLog table after data is inserted into the Production.Culture table. However, the trigger also includes a SELECT statement that returns the CultureID and Names values to the calling application. We can demonstrate this by running the following INSERT statement:

After the data has been inserted, the database engine will log the data to the EventLog table and then return the following information:

CultureID Name
tst   test culture

Unless the application includes a mechanism for handling these values, the extra resources needed to return this information are wasted. This does not mean you must always avoid returning values, just be sure you have a reason for doing so and that the application can handle them.

You should also be cautious when including hints in your queries, such as those in the following example:

In this case, I’ve included the HASH join hint, TABLOCK table hint, and KEEP PLAN query hint (a lot of hints, I realize, but I wanted to demonstrate them all at once). Regardless of the type, you should use hints only when you’re certain that you can improve on the query optimizer.

The optimizer generally does a good job of maximizing performance, in which case, using hints could make things worse. But in those rare cases when the optimizer is falling short—and you know in your heart with absolute certainty that you can do a better job—then consider the judicious use of hints. Just make sure you fully test whatever you end up implementing, using a large enough data set to provide you with meaningful results.

One final issue worth mentioning is the use of SET options, such as FORCEPLAN, which is shown in the following example:

When you set FORCEPLAN to ON, the query optimizer processes the tables in a join in the same order they’re specified in the FROM clause, overriding the query optimizer. As with hints, you want to be careful taking steps that change the optimizer’s behavior, unless you’re certain you know what you’re doing.

Plenty more to think about

In this article, we’ve covered a number of issues related to query performance, as it applies to T-SQL queries. We could have covered any one of these issues in more detail, many of which deserved articles of their own. There are also plenty of other issues we could have covered, such as avoiding the use of cursors, specifying schema names when references database objects, or not interweaving data manipulation language (DML) and data definition language (DDL) within a batch. That said, what we have covered should give you a good sense of the types of issues you should be taking into account in order to maximize performance.

Whenever you’re writing or updating T-SQL queries, you need to be thinking in terms of performance, along with security and data accuracy, of course. With performance, however, it’s not always apparent which statement construction will deliver the best results, and you might need to try different approaches before you get it right, keeping in mind that indexing and database design also play pivotal roles. The key is to evaluate each statement element to ensure that you’re taking the best approach to maximizing performance, even if it means trying several different approaches.