Oracle to SQL Server: Putting the Data in the Right Place

Jonathan Lewis is an Oracle expert who is recording his exploration of SQL Server in a series of articles. In this fourth part, he turns his attention to clustered indexes, and the unusual requirement SQL Server seems to have for the regular rebuilding of indexes.

In the previous article in this series, I investigated SQL Server’s strategy for free space management in heap tables, and concluded that SQL Server does not allow us to specify the equivalent of a fill factor for a heap table. This means that updates to heap tables will probably result in an unreasonable amount of row movement, and so lead to an unreasonable performance overhead when we query the data.

In this article, it’s time to contrast this behaviour with that observed for clustered indexes, and to find out how (or whether) clustered indexes avoid the update threat suffered by heap tables. There is an important reason behind this investigation – and it revolves round a big (possibly cultural) difference between SQL Server and Oracle. Although there are always special cases (and a couple of bug-driven problems), indexes in Oracle rarely need to be rebuilt – but in the SQL Server world I keep getting told that it’s absolutely standard (and necessary) practice to rebuild lots of indexes very regularly.

Ultimately I want to find out if regular index rebuilding really is necessary, why it’s necessary, and when it can be avoided. I’m fairly certain that I’m not going to be able the answers in the space of a single article – but I might be able to raise a few interesting questions.

Space management in Heaps: a correction

Before talking about clustered indexes, though, I need to update my observations on heap tables. After further research I realized that my initial observation that SQL Server wasn’t leaving any space in heap tables for rows to grow after the initial insert, was incomplete. SQL Server’s algorithm does allow free space, about 5% of the page (after overheads), for row growth. However, SQL Server only checks to see if you’ve hit the limit after a statement has completed. This means that a single row insert will leave you with free space of about 404 bytes possibly reduced by as much as the average row length (depending on how much free space there was in the block just before you insert the row). However, multi-row (array) inserts could easily cause blocks to be filled well past 95% and leave you with much less free space … all the way down to the zero free space that I had observed in my initial, array-based, tests.

By comparison Oracle checks the PCTFREE (100 – fillfactor, defaulting to 90, but user-definable at the table level) before each row is inserted to make sure that it won’t overshoot the limit. This means that row inserts will leave you with an average free space of PCTFREE (plus up to the average row length), which is why heap tables can be more effective in Oracle than they are in SQL Server at avoiding row migration overheads.

Space Management in Clustered Indexes

When creating a clustered index, we have a number of options to consider. Should we base our cluster on a unique or non-unique index? If unique, then should we use a “real key” or (following one or the more popular strategies) introduce a “synthetic key” through the use of an identity column or GUID column?

Let’s start with the “clustering on an identity” option with unique index and see what we can discover.

Using an IDENTITY clustering key

The test table, shown in Listing 1, is defined with an IDENTITY column, and has a clustered index with a fillfactor of 90. We then insert into the table 5000 rows, one row at a time, at an average row length of about 120 bytes.

Listing 1

After generating this data, I queried sys.allocation_units to check the space allocated for, and used by, the table (see the Listing 2 in my previous article for a sample of the code) and got the following result.

As you can see, the table and index are combined in a single object that has taken up a total of 81 pages, of which 79 are data pages.

Index rebuilds and the fillfactor

Before we do anything else, let’s rebuild the index and check to see if this changes the space usage, as shown in Listing 2.

Listing 2

Now, our query against sys.allocation_units reveals that the index in size has increased in size from 81 used pages to 89 used pages:

Why? Because of the fillfactor and the way it’s used. It’s easy to forget that the fillfactor applies only when you create or rebuild an index on existing data. Our index was based on an identity value, so each new value was larger than the previous and our cycle of inserts left each index page packed as full as it could get. When we rebuilt the index SQL Server rebuilt it at only 90% filled. If you check the arithmetic, you can see that 89 is very close 81 * 100 / 90.

Oracle and the pctfree parameter
The behavior observed here is the same as that seen in Oracle, where many people fail to realise that the pctfree parameter, which is analogous to 100 – fillfactor, only applies to indexes when creating or rebuilding them on existing data

Investigating page splitting

So what happens if I now want to update a row and make it a little bit longer (I left a couple of columns empty when I generated the data in anticipation of this test)? If a row grows too big it’s not going to fit into the page anymore, but SQL Server has to maintain the correct index structure so it’s going to have to do something with that row so that it “stays in logically the same place” but also “physically moves to where it can fit”. In effect, because the data on the page has grown, the page may to split its contents across multiple leaf pages – possibly introducing a new page into the structure to do so.

There are two well-known strategies for such a split. The more complex strategy will consider the contents of the previous and next leaf pages and may redistribute rows forwards and backwards, with the option to introduce a new empty leaf page into the index structure if the other two pages are already close to full. The aim of this strategy is to ensure that any leaf page uses at least two-thirds of its space for data storage.

The less complex, and more commonly implemented, strategy simply links a new page into the index structure and transfers roughly half the data into it. An Oracle DBA would refer to this as a “50/50 leaf node split”.

SQL Server Books Online doesn’t tell us what strategies SQL Server uses; so it’s worth running up a few tests to see what patterns of behaviour appear.

After recreating my test data (omitting the index rebuild) and doing a couple of calls to dbcc ind and dbcc page, I found that most of my leaf pages held 64 rows, of 124 bytes each, leaving (after overheads) 32 bytes of free space. I picked the page holding ids 769 to 832 (file id 1, page id 165, in my case) and updated four rows in the page one after the other, setting the vc_small column to a string of 10 characters each time.

You’ll notice in my test I’ve tried to avoid rows that might be considered “special cases”, i.e. the first or last rows in the block, and the rows right in the middle of the block. On the fourth update, and not before, the number of used_pages and data_pages reported by my “table size” query (see previous article) grew by one.

A warning about “random” data
Oracle generates random data with a package called dbms_random, which includes a seed() function. The seed() function gives us a mechanism to generate the same set of “random” data when we want to repeat the test. I haven’t yet found a simple equivalent in SQL Server, so my tests that use random data are not 100% repeatable.

Re-running the dbcc ind command, I could see that the “logical” order of pages in the index had changed from … 164, 165, 166 … to … 164, 165, 232, 166 …; and using the “dbcc page” command I could see that page 165, that used to hold 64 rows now held only 31 rows ranging from id = 769 to id = 799, while page 232 held the 33 rows with ids ranging from 800 to 832. A 50/50 leaf node split had appeared after just a few small updates to rows in the clustered index – and SQL Server had not used the more sophisticated split to move rows forwards and backwards from the adjacent blocks.

I tried a couple of variations on this experiment to see if I could get SQL Server to do anything other than a 50/50 split, but it seemed to be the only strategy used. In fact one of the tests suggested that SQL Server would even do a 50/50 leaf node split to preserve “ghost” (deleted) rows when an update to a row needed more free space than was available in a block.

Side note: I need to review the effects of the different isolation levels on  “ghost” data, and look for variations in the strategies for cleaning out data that has been marked as deleted. It’s possible that different isolation levels could result in dramatically different behaviour in index leaf blocks.

Heaps versus IDENTITY-clustered indexes

Let’s consider the implications of what we’ve seen so far. If we use a simple heap table, our data will be “naturally clustered” by time of insert. If we use single row inserts, each page will have, at most, 5% free space available to accommodate subsequent update to rows – and there is no option to vary this percentage. If we use array inserts the free space may be much smaller, even going all the way down to none, and we may see a lot of “row migration” as updated rows are moved to different pages and leave a “forwarding address” behind. The consequence of row migration is that data that used to be clustered by time of insertion may end up scattered across a large number of pages, thus increasing the work and time required for some queries.

If we use a unique clustered index based on an IDENTITY we will still be clustering our data by time of insert. Whether we use single row inserts or array inserts the amount of free space we will leave in each page will be less than the typical size of a row at time of insertion – and, just like heap tables, may be right down to zero bytes.

If subsequent updates cause rows to grow then we suffer the risk of leaf page splits. On the plus side, leaf page splits will still keep the data fairly well clustered, i.e. rows that belong together stay together, although they will be shared over two pages instead of being packed into just one. This is potentially much better than the scattering of individual rows that can take place with row migration from heap tables. On the minus side, a relatively small volume of updates could split most of the pages in the index and leave most of those pages using only 50% of the available space, hence doubling the storage space and cache required for the index.

Side note:
 It’s worth highlighting the fact that the effect of the leaf page split is to make the “physical” order of leaf pages different from the “logical” order of leaf pages. If you are doing large index range scans, i.e. ones that span several consecutive pages, then you have to wonder whether this could have any impact on SQL Server’s ability to optimise I/O with its extent-based read-ahead. For an index clustered based on an IDENTITY column that’s been used as a meaningless key, this may be irrelevant.(After all, what’s the rationale for a query like: “give me all the rows where a meaningless number is between X and Y” ; aren’t you more likely to be using queries about some meaningful aspect of the data – using an alternative indexed access path ?)

If you want to keep your data clustered by time of insert, then you need to think very carefully about how you achive this aim. I don’t think I’ve found any arguing in favour of heap tables over indexed clusters in my searches on the Internet – but I’ve just shown that things like average row lengths, method of row insertion, timing and scale of update can have a significant impact on the consequent efficiency of the data storage and clustering – which presumably creates an environment where constant housekeeping seems to be necessary..

Using a non-sequential clustering key

At this point, I have to admit to a bias: I have a strong aversion to creating keys that don’t have any corresponding meaning in the real world, so the idea of introducing an IDENTITY column just so that it can be the key for a clustered index is one that falls into the category of “probably wrong until you prove it right”.

This aversion transcends software, of course; I dislike meaningless keys on principle and many of the problems I see on large Oracle systems are side effects of giving every table a meaningless sequence-based key rather than using the “real-world” key properly. In much the same way, when I look at clustered indexes my thoughts automatically turn in the direction of a “real-world” reason for clustering.

Side note: It has crossed my mind that some of the behavior I’ve seen in my tests so far might be due to the fact that I’ve been using an identity column (and one that cannot be modified) as the basis for the clustered index. It’s possible that SQL Server has some algorithms that apply only to that special case. So I really do need to look at alternatives.

Consider, for example, an orders table, with customer_id and order_date. We could find good reasons for creating the table as a clustered index on (customer_id, order_date, sequencing_value), since an order is uniquely identified by the customer who placed the order, the date on which is was placed, and a value that tells us whether that was the first, second, and so on, order placed by that customer on that date.

As a simplified model of this scenario we could modify our test table to a clustered index on (random_data, id). If we do so, how does this affect data storage? The only change I make to my script is a modification to the “create index” statement, as shown in listing 3.

Listing 3

After populating the table the sizing information looks like this:

The object is much larger than the one we got from the clustered index on IDENTITY, and we ought to discover what’s going on. Are we seeing the effects of 50/50 index page splits now that the data is being clustered on a column that is not time-based?

First we’ll do a little arithmetic with the data pages to compare this index with the identity-based index: 116 / 79 = 1.47. In other words, the index is 47% larger than the “perfectly packed” index that we got in the previous case. Or, reversing the arithmetic, we have 5,000 rows spread across 116 data pages, or 43 rows per page compared to 64 rows per page, which means the “average” page is roughly 67% used and 33% free space, which is pretty close to what you would normally expect from a B-tree index on randomly distributed data if it were doing “50/50” splits.

As a quick cross-check on space usage, and on my arithmetic, we can rebuild the index with a fill factor of 100 to see how big it would be when perfectly packed. And it looks like Microsoft has as much trouble as Oracle with consistent use of brackets in DML! Why do I need brackets round the fillfactor on this statement, when I didn’t on the original statement to create the index?

Listing 4

As expected, the value for data_pages after this “perfect” rebuild is 79, and the 67% utilisation estimate is correct.

But first impressions, especially high-level ones, can be very deceptive. That 67% utilisation looks about right for “random data”, but my data isn’t “random” – the leading column of the index is constrained to be one of just 50 values (which are arriving in a random fashion) but the second column is monotonically increasing. If SQL Server is doing 50/50 leaf page splits with this data then I would expect to see the space utilisation of this index approaching 50% as I insert increasing numbers of rows. With the first 5,000 rows, I have an average of 100 rows for each value of random_data and with a basic strategy of 50/50 leaf page splits I would have expected to see lots of pages in this index with about 32 rows in them.

In order to understand why we might expect 50% utilisation, imagine the simplified case of inserting somewhere in the middle of the index a large number of consecutive rows with the same value for random_data, but with steadily increasing values for id. At some point, the current leaf page will be full and split – and, for the moment, let’s assume it’s a 50/50 split. After a few more splits, we will end up with a leaf block where every row has exactly the same value for random_data. When we fill this page, a 50/50 split would move roughly 32 rows into a new page, leaving roughly 32 behind, and the empty space in the lower block can never be reused. From this point onwards the pattern would repeat, leaving behind a trail of half-empty blocks, for that value of random_data.

I repeated the test with 20,000 rows instead of 5,000. At 64 rows per page, a packed index would need about 20,000/64 = 313 pages for the data. If the index is heading towards the 50% utilisation dictated by a simple 50/50 split mechanism, this would lead to roughly 626 pages.

The actual results for the 20K-row test were as follows:

The index is much smaller than it should be if it were using a 50/50 split; it has an average of 54.2 rows per data page, which is even better than for the original 5,000-row data set. Just to check my arithmetic, I rebuilt the index with a fillfactor of 100 and got the 313 pages I would have expected. So, SQL Server is clearly not using a simple 50/50 split for this index, which means that we need to use dbcc ind() and dbcc page() to find out what’s going on.

I’ve deleted a lot of rows and columns from the output, but the sample output shows that:

·         file 1, block 145 is the IAM  page.

·         file 1 block 146 is the index root page

·         file 1 block 144 is the first index leaf page (it has no previous page)

·         file 1 block 239 is the second index leaf page (previous page is 144)

By dumping the index root block I can see that the first seven pages of the index, in key value order, are: 144, 239, 301, 350, 408, 453, and 509. All these pages hold rows where random_data has the value zero, and the next page (201) holds rows where random_data has the value one. This makes 509 the most interesting page – it’s the one which may contain rows for both values of random_data, so I’m going to dump it to see how many rows it holds, and what they look like.

The level 3 dump held a total of 18 rows. It was still a lot of raw output so I’ve extracted just a tiny piece of it – the key values from the lowest and highest values stored in the block.

Notice how the lowest value and the hightest value are both random_data = 0. Checking the first six blocks in the index, every one of them is 100% packed with 64 rows (no free space), for a total of 402 rows with random_date = 0. This, for reasons I’ll explain in a moment, is an interesting pattern – so I went through several other sections of the index checking to see if the pattern repeated.

Picking the index pages which held entries for random_data = 9, I found about 400 rows, six blocks which were 100% packed, and the seventh block with just a few entries. The same pattern appeared when I checked for random_data = 10. In fact, for just about every value I checked I found six packed blocks and a seventh block with just a few entries.

This pattern tells us quite clearly that SQL Server has not been using a simple-minded 50/50 block split as the data arrived. Somehow (perhaps as a side effect of a more generic strategy) it has managed to “recognise” a data stream where it ends up packing the data very efficiently doing “100 / 0” block splits in the middle of the index.

As with the index on an identity column, this has a possible benefit, and introduces a possible threat. On the plus side we may end up with very well packed data and an clustered index that was about as small as it could possibly be. On the minus side we could get unlucky – if the way we use the data means we go “back in time” to blocks that are running at 100% packed and start updating them (in ways that make the rows longer) we could end up with a lot of 50/50 page splits appearing – and end up with a clustered index that was about twice the size that it “needed” to be, possibly encouraging people to believe that they had to keep rebuilding the index to reduce the size and improve efficiency.

I’ve still only scratched the surface of what goes on inside clustered indexes, of course, and every result suggests new tests. I’ve found that a model emulating “lots of orders appearing for a few customers over time” gives me a clustered index that achieves a remarkable level of packing. What happens if I change the model to “a few orders appearing over time for a lot of customers”, i.e. examine what happens if the non-unique portion of the key is only “slightly” repetitive. To do this I’m going to change the definition of the random_data column from:

to

I then ran the test with 20,000 rows inserted so that I got approximately 20 rows per value. With this change in place, this is what we see when we compare the sizing information (the figures for 20,000 rows with 400 rows per value first, followed by the figures for 20 rows per value).

As you can see, although the volume of data is the same, the change in the pattern of distribution has given us an index that is significantly larger; running, in this case, at about 65% space usage per leaf page rather than 85%. That’s an interesting difference for an apparently marginal reason – and raises the question of how much impact this increase in size (and scatter) could have on performance. This, perhaps, is where the biggest difference appears between an Oracle DBA and a SQL Server DBA.  To an Oracle DBA an index running at a uniform 65% space utilisation is not usually seen as a threat – but Oracle DBAs tend to think in terms of B-tree indexes on heap tables where the index is typically small compared to the table. But if the index is a clustered index then it’s carrying the table data and that 35% empty space could be an enormous amount of space resulting in increased disk I/O and more thrashing of the cache – if that 35% of space is never going to be refilled it makes much more sense for the SQL Server DBA to think about rebuilding the index.

Data distribution and index performance

It’s time to stop and take stock of what we’ve just demonstrated. When you create a clustered index that has an intended benefit of clustering the data on a “real” attribute, then the space utilisation and performance characteristics of queries may vary significantly with the pattern and timing of data arrival.

If each “cluster key” (such as the customer id) has a large number of rows, then the data packing can be very efficient because of a special 100/0 split algorithm – unless rows are updated so long after the initial insert that they cause lots of subsequent 50/50 page splits on older pages.

If each cluster key has a small number of rows, combined with random arrival, then it’s quite possible that the clustered index will eventually settle into something close to 50% space utilisation.

Different patterns of use can result in significant variations in the way that data is packed in clustered indexes – and some cases will result in the index being much bigger than it needs to be, and this may have an impact on performance. 

It would be nice to be able to give some guidelines on how to identify clustered indexes where “empty space” matters enough to make it worthwhile rebuilding the index from time to time – but I’m only a beginner with SQL Server, so I’m not the one to make such suggestions. One thing is clear from my tests, though, there are cases where you definitely shouldn’t be rebuilding indexes because sometimes an index will get bigger and waste more space when it’s rebuilt.

Conclusion

There are many different ways one could poke around at clustered indexes and how they work, and this article has only scratched at the surface of a couple of cases. I haven’t even started to think about the consequences of deletes, or updates that change key values.

I hope, though, that the article has made one important point clear: clustered indexes are about putting the data in the right place, and we need to know something about how SQL Server manages to place data so that we can see how close to “perfect placement” we can get, and how far from perfection we can afford to be before it makes too much difference to performance.