ShowPlan Operator of the Week – Split, Sort, Collapse

'Split, Sort & Collapse' is, happily, not a description of the intrepid Fabiano after his epic series of articles about the ShowPlan operators. With renewed stamina, Fabiano continues his mission to describe all the major operators by explaining the Trivial Plan and the power of unique indices.

Once again, here I am to talk to you about ShowPlan operators. Over the past few weeks and months, we’ve featured the ShowPlan operators used by SQL Server to build query plans. If you’re just getting started with my Showplan series, you can find a list of all my articles here. Before we start to talk about the operators of this article, I’d like to ask you to participate in the creation of this series; if you want to know more about a Showplan operator that I haven’t covered yet, please leave a comment in this article with your suggestion.

I also want to thank all of you for the kind words/comments I’m receiving about the articles; I’m flattered to know you are enjoying this series and that it has been useful for you.

Introduction

Today we’ll feature two new operators and describe how the sort operator can be used to help with the validation of a query. The sort operator has already been covered in this article, so, if you want to know more about it just click here.

In short, these operators, Split and Collapse, are used to identify a phantom unique key violation. To help to understand this better, I’ll start by explaining the unique index and how it is used by the query optimizer: Then I’ll go on to present a update command that uses these operators.

Unique Index

The Unique Index is an index that guarantees that no duplicate values are allowed in a specified key column. In other words, if you want to be sure that some value can’t be duplicated in your table, then a unique index can be made responsible for enforcing this. A classic use for a unique index is as the primary key of your table, where no duplicated rows are allowed.

Every time that you create a new index, decide whether this can be a unique index. Why? Because the Query Optimizer can then use this index to simplify a query plan: If it knows that the rows are always going to be unique, it knows that the selectivity is always one. The more selective an index has, the greater the likelihood that it will be used by a query.

Creating sample data

To illustrate the operators that are the subject of this article, I’ll start by creating one table called “TabTest”. The following script will create the table and populate it with some meaningless data:

You’ll have noticed that the table has four indexes, one for each column. The difference between the columns Name and Name2 is that the indexes on the columns with the prefix “2” are non-unique indexes, and the index on the columns Name and Val are unique.

The value of the columns Name2 and Val2 are the same as the columns Name and Val.

Here is what the data looks like:

1276-Fig1.jpg

Querying an Unique Index

Now that we have created and populated the table, let’s try two queries: First, a query that selects all the rows using a filter on the column with the unique index:

For the query above we have the following execution plan:

1276-Fig2.jpg

As we can see from the execution plan, SQL Server chose to perform an Index Seek on the ix_Name_Unique index and a Key Lookup to read the data on the clustered index.

Now let’s try to run the same query using the column with the non-unique index:

For the query above we have the following execution plan:

1276-Fig3.jpg

As we can see, now SQL Server chose to run a scan on the clustered index.

Because the first query is using a Unique Index, SQL Server decides to simplify matters by merely creating a Trivial Plan. Oops! I have a feeling that I haven’t yet described what a trivial plan is, so let me do it now.

Trivial Plan

An execution plan is created during the optimization process.  Generally,  SQL Server has to  use a variety of techniques to try to optimize your query in order to  create the best-possible plan. Sometimes, however, the Query Optimizer can decide that it does not need to do the full  optimization process  to find the best plan, Because it has determined that there is a plan that is  good enough.

Within the execution plan, we can see if the query optimizer has decided that the Plan is trivial by looking at the properties of the plan: For instance, let’s look at the execution plan of the first query:

1276-Fig4.jpg

Apart from that, we also can see if the plan is trivial by looking at the XML plan:

1276-FigMisc.jpg

A trivial plan is created when the Query Optimizer finds an optimal way of read the data that you are querying. For instance a “select * from table” probably will create a trivial plan that merely reads the data from the clustered index. When a trivial plan is selected, the Query Optimizer doesn’t then need to expend resources in trying to figure out the best plan because it knows that this trivial plan is the best option.

If a plan is determined to be trivial, then the query isn’t recompiled when an Update Statistics occurs on the statistics, and an Auto Update Statistics isn’t fired. Let’s see this in the practice.

Let’s try the first query. If we look at the cache plan, we can see that we have the trivial plan and that it was been reused:

1276-Fig5.jpg

In this query, I start by freeing the plan cache; then I run the query twice. Finally, I query the DMVs to find out how many times the plan was reused, and to see the StatementOptmLevel within the XML plan.

As we can see in the picture (column UseCounts), the query was executed twice and the plan was reused.

Now, let’s insert some data in the table in order to force an automatic update statistics. If you want to know how much data do you will need to change so as to trigger an update statistics, then read the item 13 of this article.

After inserting 500 rows into the table, I’ll run the query again and check whether the plan was reused or if a new plan was created as a result of the auto update statistics. Optionally, you can check the event SP:StmtCompleted in the profiler to see if the update statistics was executed.

1276-Fig6.jpg

Now we can see that the plan was reused, even after we had inserted sufficient data to fire an update statistics and thereby cause the creation of a new plan.

Let’s try the same exercise with the non-trivial plan:

1276-Fig7.jpg

Again the plan was reused. Now let’s see what happens after we insert the new data.

1276-Fig8.jpg

As we can see, the auto update statistics was now triggered and a new plan was created.

Full Optimization

An interesting thing is that, in the execution plans, we can only see two level of optimization, TRIVIAL and FULL. But behind the scenes we can see that the FULL optimization is divided into three steps: These steps are called Search 0, Search 1 and Search 2. (Yes, I don’t like these names either, I expected something more intuitive).

If we query the DMV sys.dm_exec_query_optimizer_info, we can see which phase was executed in a FULL optimization.

1276-Fig9.jpg

In the picture above, we can see that the Search 1 counter has increased by one. If you run the query using the unique index, you will see the trivial plan counter increasing.

Some queries require a full optimization. This means that, if you have a heavy query with lot of joins, it may use search 1 or if you have a query that maybe uses a indexed view it will use search 2. You can play with the DMV to see the level of optimization that you got in your query.

Back to – Querying an Unique Index

After explaining both trivial and full optimization, let’s see more samples that demonstrate the benefit of having a unique index.

The Query Optimizer can :avoid an unnecessary distinct:

1276-Fig10.jpg

In this execution plan, the Query Optimizer knows that the DISTINCT clause is totally unnecessary, and it doesn’t waste time trying to remove the duplicated occurrences because it knows that, if the column is unique, then it is sufficient to just read the data from the index.

The Query Optimizer can avoid an unnecessary aggregation:

1276-Fig11.jpg

As we can see, there is no aggregation needed to compute the SUM because SQL Server knows that there is only one row per Name. SQL is reading the data from the clustered index. The mere presence of the unique index is enough to enable a optimization, even though SQL Server is not actually reading the data from that index.

The Query Optimizer can avoid an assert validation:

1276-Fig12.jpg

Here we can see that SQL Server is not using the operator Assert to validate the Sub-Query expression. If you want to know more about the operator assert click here.

Non-Unique Index and Updates

Every update is executed in two main steps. Firstly, SQL Server has to read the data that will be updated, and then it must update these values.

Let’s analyze a plan that updates a non-unique indexed column. Here is a query that updates the column Val2 that doesn’t have a unique index.

1276-Fig13.jpg

Execution Plan in text mode:

As usual, let’s go through this execution plan step by step analyzing what is happening with some operators of the plan.

  • Clustered Index Scan – Here, the SQL Server is reading the data that will be updated on the clustered index.
  • TOP – The TOP operator is necessary in order to perform the SET ROWCOUNT. I know that sounds a little weird the first time that you hear it, because we aren’t executing the SET ROWCOUNT. What happens here is that, if you change the value of the ROWCOUNT, it will not trigger a recompilation of the plan, which means that SQL has to find some way to execute the query using a cached plan, plus updating just those rows specified in the ROWCOUNT.
  • Compute Scalar – In this operation, we can see the new optimization that is created on SQL Server 2005 in order to avoid the update of a value that hadn’t changed. I already covered this optimization at the Item 11 of this article.

If we analyze the text execution plan, we can see that there is a case expression that checks whether the value has changed:

As we can see, if the value is the same, the value 1 (one) will be returned to the [Expr1007] and if not, then 0 (zero) will be returned.

  • Clustered Index Update – Here, the optimizer will execute the Update. An interesting thing here is that we can call this plan a “Narrow Plan” or a “Per-Row Plan”.

You’ll have probably noticed from the text execution plan that there are two indexes being updated by this operator, the clustered index and the index ix_Val2_NonUnique.

Every time that you update a column that has a non-clustered index, SQL Server has to maintain the data in the clustered and the nonclustered indexes by updating it. Usually this update is executed in the clustered index key order, which means that the nonclustered updates are executed in a random order (because it’s on the clustered index order) and not in the nonclustered key order.

Sometimes, the Query Optimizer can create a “Wide Plan” that reads the data from each updated index and executes the update in the index order

To show you a sample of a wide plan, I’ll create some new indexes and I’ll change the number of rows and pages on the statistics of the table TabTest:

 

After messing with the statistics, let’s try the same update:

1276-Fig14.jpg

1276-Fug15.jpg

Execution Plan in Text Mode:

Now we have a bigger plan and we can see that SQL is updating one index per time, using the data sorted by the index key.

Unique Index and Updates

After some explanation about unique indexes, it is now time to show you, in action, the three operators that are the subject of this article. Let’s run the update in the column with the unique index and analyze the plan.

1276-Fig16.jpg

Part of the plan was omitted so as to concentrate on the topic.

In the beginning of the article I made the statement that these operators are used to avoid a phantom unique key violation. Let’s try to understand that statement a bit better.

The data on the column Val is the following:

1276-Fig17.jpg

We already know that an update is executed by a delete operation followed by an insert operation. You can see an explanation about it on the Item 11 of this article. We can also detect that the update has been executed by the clustered key order.

Following this logic, let’s mimic what will happen.

  1. Read the first row from the clustered index to be updated. Result, ID = 1
  2. Find the row ID = 1 in the unique index. Result, Val = 1
  3. Delete the row with the value Val = 1
  4. Insert the new value Val = 1 + 1 (2)

Here we have a phantom unique key violation because the value 2 already exists in the Unique Index. Because this value will eventually also be updated, this is a false violation.

To avoid it, SQL Server uses the operators Split, Sort and Collapse in order to reorganize the Deletes and the Inserts.

In our table we have the following rows to be updated:

1276-Fig18.jpg

The column Val is the key column that we need to update, and the column New_Val is the value of the column after the update.

After the Split occurs, SQL Server get the deletes and the inserts of the values, and a new list is created:

1276-Fig19.jpg

The Split creates a list with all the deletes and the inserts, but this list is not properly ordered because the insertion of the value 2 is done before the delete, so we still have the phantom violation.

The operator Sort is used to reorder the list by Action plus the Key Value, and the Deletes then appear before the Inserts. So, after the sort operation, we have the following list:

1276-Fig20.jpg

Now the delete appears before the insert. If SQL Server runs the deletes and inserts in that order then all goes well and no phantom read occurs. But we still have the collapse operator.

The collapse operator combines those adjacent ‘delete’ and ‘insert’ pairs that share the same key value into a single ‘update’. Collapse finds ‘delete’s and ‘insert’s for the same key and changes them into just one ‘update’.

For instance, why should I have to delete and insert a new row with the value 2? If I can just ignore this step then it’s a good improvement. Right?

After the collapse we have the following actions to be executed:

1276-Fig21.jpg

Because I know you are clever, you are probably thinking. – Ummm. Why update the value 2 to 2? Isn’t better if we got a list just like the following?

1276-Fig22.jpg

If we perform these two actions we have the expected final result.

I have to confess that I didn’t know why SQL Server executed these updates until I watched the Conor Cunningham session at the PASS 2010 talking about updates.

According to him, SQL Server needs to update the same row to guarantee that the stored engine puts the locks on the rows that are being updated. If SQL Server doesn’t lock the row, another transaction can change the row and the update would consequently fail.

Finally

I want to say ‘thank you’ to the people from the Query Optimization community (yes there are some guys out there just talking about QO) that indirectly help me to understand it better. Paul White (BTW congrats for the MVP award), Conor Cunningham, Benjamin Nevarez (BTW congrats for your Book) and Craig Freedman.

If you don’t follow these guys, you are missing a lot of good stuff about query optimization.

That’s all folks, I hope you’ve enjoyed learning about these operators, and I’ll see you soon with more “Showplan Operators”.