Should you always use a clustered index?

In one of his recent editorials, Steve Jones of SQLServerCentral wondered about whether every table evar should have a primary key, and in doing so, mentioned the often-quoted advice of also making sure every table has a clustered index.

If you already know about the difference between a clustered and non-clustered index, you can safely skip the next couple of paragraphs. If not, here goes:

Take the example of a phone book. The actual data – that is, the name, address and phone number records – is ordered by the name. If you want to look up Joe Bloggs’s phone number, you open the book somewhere near the middle, maybe see the names there start with “M”, but “Bloggs” is before “M”, so you go a bit earlier in the book. You keep narrowing it down until you find the entry labelled Bloggs, and that’s it – all the data for that record is right there. That’s a bit like a clustered index.

On the other hand, a book might have a table of contents, sorted alphabetically. If you want to find out about llamas, you search the contents for llamas, which probably then gives you a page number, at which point you go to the page, and there’s the data about llamas. The difference here is that you’ve had to do an extra bit of indirection – following the page number pointer – in order to get to the data. You can probably now see that while you can have as many tables of contents, ordered in any way you like, one set of data can only be physically arranged in one way. This means you can have many non-clustered indexes, but only one clustered index on a table.
OK, back to the original point of the post. In SQL Server, if you’ve only got a non-clustered index on a table, but no clustered index, then the “pointers” in the non-clustered index actually point to the physical location in the file on disk where the data resides. On the other hand, if you’ve also got a clustered index, then the non-clustered index contains instead the key into the clustered index. This means that whenever you access a record through the non-clustered index, you must first traverse that, then traverse the clustered index as well – more expensive than simply going to the row directly.

So, I was wondering, how expensive is this extra indirection? In the true spirit of micro-benchmarks, the following data should be treated with an appropriate amount of suspicion.

Firstly, I created six tables, each with three integer columns named [one], [two] and [three] (logical!). The test would be to read five million rows from each of these tables, ordered by the value in column one. The data in columns [one] and [two] was random and uncorrelated, and the data in [three] was equal to that in [one]. Before inserting the data, I created the following sets of indexes:

  • None at all!
  • A clustered index on the “wrong” column ([two])
  • A clustered index on the “right” column ([one])
  • A clustered index on [two] and a non-clustered index on [one]
  • A clustered index on [three] and a non-clustered index on [one]
  • A non-clustered index on [one]

After doing this, I restarted the instance of SQL Server in order to clear any cache it might have, then read the contents of the tables using a quick C# application I knocked up. Some of the results really surprised me:
Read 5000000 rows
ClusteredWrongColumn: 15241.072ms
Read 5000000 rows
ClusteredRightColumn: 4229.28ms
Read 5000000 rows
ClusteredAndNonclustered: 19376.368ms
Read 5000000 rows
NonclusteredAndSimilarClustered: 10259.92ms
Read 5000000 rows
NonclusteredOnly: 11011.792ms
Read 5000000 rows
NoIndexes: 17856.96ms
Re-running the application without restarting SQL Server changed the results somewhat, presumably due to caching (the RAM footprint of the instance had gone from 400MB before the first run to 1.3GB afterwards):
Read 5000000 rows
ClusteredWrongColumn: 17715.984ms
Read 5000000 rows
ClusteredRightColumn: 3837.68ms
Read 5000000 rows
ClusteredAndNonclustered: 17120.752ms
Read 5000000 rows
NonclusteredAndSimilarClustered: 14614.512ms
Read 5000000 rows
NonclusteredOnly: 10588.864ms
Read 5000000 rows
NoIndexes: 13752.992ms
So, having the clustered index on the right column obviously beats having it on the wrong column, and having a non-clustered index beats having none at all. So far so good.

Having the non-clustered index be on “similar” (OK, identical, but I bet it’d work with not-quite-identical-but-close data as well) data to the clustered was also clearly better than when they were totally uncorrelated, and this also makes sense: the page cache will get much less thrashed than if you’re jumping around all over the place in the clustered index rather than reading it more-or-less sequentially.

As I suspected, a non-clustered index without a clustered index underneath it performed significantly better: ten seconds rather than fifteen to twenty.

However, there was one bit that really surprised me: the ClusteredAndNonclustered result was similar in the second run, and significantly worse in the first run, than the ClusteredWrongColumn result. I guess SQL Server must have done an in-memory sort of the table when the non-clustered index didn’t exist, which meant fewer disk reads in order to get the non-clustered index when it did exist. That’d also explain the much closer results in the second run.

What was the point of this post? Curiousity mainly, but it’s also worth realising that a clustered index isn’t always the best thing in the world for every table, especially if you’re searching over a wide variety of very different fields, rather than just doing lookups on one field for the majority of the time. You need to look at each case on its own, and understand exactly what the queries are you’re running before making a decision.

Finally, I should mention that this isn’t the whole story. My tests used a load of data inserted into an empty database, then queried. Your databases probably have a whole lot more update and delete operations going on, and they can have a big impact on the way data ends up being laid out on disk – yet another variable to consider!