11 May 2018

1 Comment

11 May 2018

1 Comment

Choosing Between Table Variables and Temporary Tables

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

People can, and do, argue a great deal about the relative merits of table variables and temporary tables. Sometimes, as when writing functions, you have no choice; but when you do you’ll find that both have their uses, and it’s easy to find examples where either one is quicker. In this article, I’ll explain the main factors involved in choosing one or the other, and demonstrate a few simple ‘rules’ to get the best performance.

Assuming you follow the basic rules-of-engagement, then you should consider table variables as a first choice when working with relatively small data sets. They are easier to work with and they trigger fewer recompiles in the routines in which they’re used, compared to using temporary tables. Table variables also require fewer locking resources as they are ‘private’ to the process and batch that created them. SQL Prompt implements this recomendation as a code analysis rule, ST011 – Consider using table variable instead of temporary table.

If you are doing more complex processing on temporary data, or need to use more than reasonably small amounts of data in them, then local temporary tables are likely to be a better choice. SQL Code Guard includes a code analysis rule, based on his recommendation, ST012 – Consider using temporary table instead of table variable, but it’s not currently implemented in SQL Prompt.

Pros and cons of table variables and temporary tables

Table variables tend to get ‘bad press’, because queries that use them occasionally result in very inefficient execution plans. However, if you follow a few simple rules, they are a good choice for intermediate ‘working’ tables, and for passing results between routines, where the data sets are small and the processing required is relatively straightforward.

Table variables are very simple to use, mainly because they are “zero maintenance”. They are scoped to the batch or routine in which they are created, and are removed automatically once it completes execution, and so using them within a long-lived connection doesn’t risk ‘resource hogging’ problems in tempdb. If a table variable is declared in a stored procedure, it is local to that stored procedure and cannot be referenced in a nested procedure There are also no statistics-based recompiles for table variables and you can’t ALTER one, so routines that use them tend to incur fewer recompiles than those that use temporary tables. They are also not fully logged, so creating and filling them is faster and requires less space in the transaction log. When they are used in stored procedures, there is less contention on system tables, under conditions of high concurrency. In short, it is easier to keep things neat and tidy.

When working with relatively small data sets, they are faster than the comparable temporary table. However, as the number of rows increases, beyond approximately 15K rows, but varying according to context, then you can run into difficulties, mainly due to their lack of support for statistics. Even the indexes that enforce PRIMARY KEY and UNIQUE constraints on table variables do not have statistics. Therefore, the optimizer will use a hard-coded estimation of 1 row returned from a table variable, and so will tend to choose operators optimal for working with small data sets (such as Nested Loops operator for joins). The more rows in the table variable, the larger the discrepancies between estimation and reality, and the more inefficient become the optimizer’s plan choices. The resulting plan is sometimes frightful.

The experienced developer or DBA will be on the lookout for this sort of problem, and be ready to add the OPTION (RECOMPILE) query hint to the statement that uses the table variable. When we submit a batch containing a table variable, the optimizer first compiles the batch at which point the table variable is empty. When the batch starts executing, the hint will cause only that single statement to recompile, at which point the table variable will be populated and the optimizer can use the real row count to compile a new plan for that statement. Sometimes, but rarely, even this won’t help. Also, over-reliance on this hint will negate to some extent the advantage that table variables have of causing fewer recompiles than temporary tables.

Secondly, certain index limitations with table variables become more of a factor when dealing with large data sets. While you can now use the inline index creation syntax to create non-clustered indexes on a table variable, there are some restrictions, and there are still no associated statistics.

Even with relatively modest row counts, you can encounter query performance issues if you try to execute a query that is a join, and you forget to define a PRIMARY KEY or UNIQUE constraint on the column you are using for the join. Without the metadata that they provide, the optimizer has no knowledge of the logical order of the data, or whether the data in the join column contains duplicate values, and will likely choose inefficient join operations, resulting in slow queries. If you’re working with a table variable heap, then you can only use it a simple list that is likely to be processed in a single gulp (table scan). If you combine both use of the OPTION (RECOMPILE) hint, for accurate cardinality estimations, and a key on the join column to give the optimizer useful metadata, then for smaller data sets you can often achieve query speeds similar to or better than using a local temporary table.

Once row counts increase beyond a table variable’s comfort zone, or you need to do more complex data processing, then you’re best switching to use temporary tables. Here, you have the full options available to you for indexing, and the optimizer will have the luxury of using statistics for each of these indexes. Of course, the downside is that temporary tables come with a higher maintenance cost. You need to make sure to clear up after yourself, to avoid tempdb congestion. If you alter a temporary table, or modify the data in them, you may incur recompiles of the parent routine.

Temporary tables are better when there is a requirement for a large number of deletions and insertions (rowset sharing). This is especially true if the data must be entirely removed from the table, as only temporary tables support truncation. The compromises in the design of table variables, such as the lack of statistics and recompiles, work against them if the data is volatile.

When it pays to use table variables

We’ll start with an example where a table variable is ideal, and results in better performance. We will produce a list of employees for Adventureworks, which department they work in, and the shifts they work. We’re dealing with a small data set (291 rows).

We will put the results in a second temporary table, as if we were passing the result on to the next batch. Listing 1 shows the code.

And here is a typical result on my slow test machine:

Using a temporary table is consistently slower, though individual runs can vary quite a lot.

The problems of scale and forgetting to provide a key or a hint

What’s the performance like if we join two table variables? Let’s try it out. For this example, we need two simple tables, one with all the common words in the English language (CommonWords), and the other with a list of all the words in Bram Stoker’s ‘Dracula’ (WordsInDracula). The TestTVsAndTTs download includes the script to create these two tables and populate each one from its associated text file. There are 60,000 common words, but Bram Stoker only used 10,000 of them. The former is well outside the break-even point, where one starts to prefer temporary tables.

We’ll use four simple, outer join queries, testing the result for NULL values, to find out the common words that aren’t in Dracula, common words that are in Dracula, words in Dracula that are uncommon, and finally another query to find common words in Dracula, but joining in the opposite direction. You’ll see the queries shortly, when I show the code for the Test Rig.

Following are the results of the initial test runs. In the first run, both table variables have primary keys, and in the second they are both heaps, just to see if I’m exaggerating the problems of not declaring an index in a table variable. Finally, we run the same queries with temporary tables. All tests were run, deliberately, on a slow development server, for purposes of illustration; you will get very different results with a production server.

The results show that when the table variables are heaps, you run the risk of the query running for ten minutes rather than 100 milliseconds. These give a great example of the ghastly performance you can experience if you don’t know the rules. Even when we use primary keys, though, the number of rows we’re dealing with mean that using temporary tables is now twice as fast.

I won’t delve into the details of the execution plans behind these performance metrics, other than to give a few broad explanations of the main differences. For the temp table queries the optimizer, armed with a full knowledge of cardinality and the metadata from the primary key constraints, chooses an efficient Merge Join operator to perform the join operation. For the tables variable with primary keys, the optimizer knows the order of the rows in join column, and that they contain no duplicates, but assumes it’s only dealing with one row, and so chooses instead a Nested Loops join. Here, it scans one table and then for each row returned performs individual seeks of the other table. This becomes less efficient the larger the data sets, and is especially bad in the cases where it scans the CommonWords table variable, because it results in over 60K seeks of the Dracula table variable. The Nested Loops join reaches ‘peak inefficiency’ for two, ten-minute queries using table variable heaps, because it entails thousands of table scans of CommonWords. Interestingly, the two “common words in Dracula” queries perform much better and this is because, for those two, the optimizer chose instead a Hash Match join.

Overall, the temp tables look to be the best choice, but we’re not finished yet! Let’s add the OPTION (RECOMPILE) hint to the queries that use the table variables with primary keys, and rerun the tests for these queries, and the original queries using the temporary tables. We leave out the poor heaps for the time being.

As you can see, the performance advantage of the temporary table vanishes. Armed with correct row counts and ordered inputs, the optimizer chooses the far more efficient Merge Join.

What, you wonder, would happen if you gave those poor heaps the OPTION (RECOMPILE) hint too? Lo, the story changes for them so that all three timings are much closer.

Interestingly, the two “common words in Dracula” queries that were fast even on heaps are now much slower. Armed with the correct row counts, the optimizer changes its strategy, but because it still has none of the useful metadata available to it when we define constraints and keys, it makes a bad choice. It scans the CommonWords heap then attempts a “partial aggregation”, estimating that it will aggregate down from 60K rows to a few hundred. It doesn’t know that there are no duplicates, so in fact it doesn’t aggregate down at all, and the aggregation and subsequent join spill to tempdb.

The Test Rig

Please note that this is the test rig in its final form showing roughly equal performance for the three different types of table. You will need to remove the OPTION (RECOMPILE) hints to get back to the original.

Listing 2

Conclusions

There is nothing reckless about using table variables. They give a better performance when used for the purposes for which they were intended, and they do their own mopping-up. At a certain point, the compromises that give them a better performance (not triggering recompiles, not providing statistics, no rollback, no parallelism) become their downfall.

Often, the SQL Server pundit will give sage advice about the size of result that will cause problems for a table variable. The results I’ve shown you in this article will suggest to you that this oversimplifies the issues. There are two important factors: if you have a result of over, let us say, 1000 rows (and this figure depends on context) then you need to have a PRIMARY KEY or UNIQUE key for any queries that join to a table variable. At a certain point, you will also need to trigger a recompile to get a decent execution plan, which has its own overhead.

Even then, performance can suffer badly, especially if you’re performing more complex processing, because the optimizer still has no access to statistics, and so no knowledge of the selectivity of any query predicate. In such cases, you’ll need to switch to using temporary tables.

Further Reading

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Related posts

Also in Hub

Documenting your Database with SQL Change Automation

It would be wrong to portray SQL Change Automation (SCA) as being suitable only for epic project deployments, of the sort described in my previous article. It can do smaller tasks as well. To demonstr...

Also in SQL Prompt

SQL Prompt Code Analysis: INSERT INTO a permanent table with ORDER BY (PE020)

The SQL query that is used to produce the result that is inserted into the permanent table has its order specified by an ORDER BY statement. Relational tables are not ordered, so the ORDER BY is meani...

Also in Product learning

Spoofing Realistic Credit Card Data for your Test Systems using Data Masker

Data protection and privacy regulations, ranging from GDPR to HIPPAA to PCI, among many others, put strict compliance requirements on the storage and use of personal and sensitive data, in any of your...

Also about SQL code analysis

SQL Prompt code analysis: avoid non-standard column aliases

There was a time when column aliases in SQL Server and Sybase were only declared by an assignment expression, and some people are still fond of the old way of doing it, in much the same way as some of...