Pagination and ordering by large values

Comments 0

Share to social media

Pagination is a technique for limiting output. Think of Google search results, shopping the electronics category on Amazon, or browsing tagged questions on Stack Overflow. Nobody could consume all of the results in a single shot, and no site wants to spend the resources required to present them all to us, so we are offered a manageable set of results (a “page”) at a time.

In the most basic cases, we can implement this functionality in SQL Server by using OFFSET/FETCH. The problem is that anything that uses TOP or OFFSET will potentially have to scan everything in the index up until the page requested, which means that queries become slower and slower for higher page numbers. To achieve anything close to linear performance, you need to have a narrow, covering index for each sort option, or use columnstore as Erik Darling recommends here, or concede that some searches are just going to be slow. Throw in additional filtering, pulling data from other tables, and letting users dictate any sort order they want, and it becomes that much harder to tune for all cases.

I have a lot that I want to say about paging, and I will follow up with more content soon. Several years ago, I wrote about some ways to reduce the pain here, and it is long overdue for a refresh. For today’s post, though, I wanted to talk specifically about pagination when you have to order by large values. By “large” I mean any data type that can’t fit in an index key, like nvarchar(4000) or, really, anything that can’t lead in an index and/or would push the key past 1,700 bytes.

An Example

Let’s say we have a Users table on SQL Server 2022:

And let’s populate that table with a million rows (which I update after the insert to use an expression against the identity column):

Spot checking the first few rows:

Inspecting a few rows of sample data

Let’s also say that the user interface that lets you page through these users, 25 at a time, lets you sort by any of these columns – including AboutMe. If you’re already getting hives, I don’t blame you.

We’ll use dynamic SQL (a la the kitchen sink approach, also due a refresh) for the initial paging mechanism, supporting a different plan for each sort option. I’m going to leave out error handling for brevity, and stick to the simple case where we’re just browsing all users in some defined order (filtering will come later). We’ll skip right to the case where we use a CTE to first derive just the key values before we pull the rest of the information, e.g.:

For the procedure in my test rig, I’m adding two additional parameters that you wouldn’t use in production: @TestNumber for evaluating performance, and @Output to suppress rendering results (I’ll dump the rows into a #temp table, so I can run lots of tests without overwhelming SSMS). I also use MAXDOP to exaggerate the impact without parallelism. Here’s how the initial procedure looks:

Now, this code and this table and this data may produce slightly different plans on your machine, with or without MAXDOP, due to a variety of reasons. But the impact seen in the plans and runtime should be similar.

The good sort

This technique works great, of course, when we sort by the UserID column (in either direction):

Runtime metrics sorting by UserID

With a supporting index, even though it’s wide, no work is spent sorting. You can see how late in the plan the sort operator is introduced; after we’ve already filtered down to 25 rows:

Plan for sorting by UserID

The bad sort

But let’s pretend there is a very good reason to sort by the AboutMe column instead. When we put that in the ORDER BY, it gets ugly:

Runtime metrics sorting by AboutMe

The sort has to occur much earlier in order to locate the 25 rows we need:

Plan for sorting by AboutMe

What can we do when we have to sort by some column that can’t be sorted efficiently?

Saved by columnstore?

Often, a non-clustered columnstore index can compensate for the “I want to sort by any column but I don’t want to index every column” issue, as each column acts as its own little mini-index in a way. But it doesn’t help much in this case:

Runtime metrics for columnstore

The plans for both sort options are quite similar but, again, pay particular attention to where the sort happens (for UserID it’s after we’re down to 25 rows; for AboutMe it’s against all 1MM):

Plan for columnstore, sorting by UserID

While the sort may have been more efficient against fewer pages, it still has to sort all one million values – whether the 25 rows we’re after are at the beginning or end of the set.

Plan for columnstore, sorting by AboutMe

And while they both say they represent 84%, they’re not the same – those are relative (estimated) costs of two very differently-costed plans.

Even in the best case scenario, where the columnstore index does help drive down duration, you still need to account for the overhead involved, and it might not always turn out in your favor.

Another technique

A way I have improved this in a few scenarios is to index a computed column representing the first n characters of the longer string. Where to peg n depends on the nature of the data, and has to allow for unexpected sort order if there are two rows where the first n characters are identical. But let’s say we can be reasonably confident that the first 100 characters are “unique enough”:

In order to take advantage of this, our code needs to conditionally swap out the original column name with the computed column:

Now we get performance much more in line with sorting based on any other narrow index:

Runtime metrics for computed column

Plan for computed column, sorting by AboutMe

You can play with the PERSISTED option, but this may add unnecessary storage – it’s a common misconception that computed columns must be declared as PERSISTED to be indexed, which isn’t true, though there are cases where persisting the column is important.

Neither choice avoids maintenance overhead, of course. Like any index or computed column, you need to weigh the benefit it provides against the maintenance cost it creates.

At the end of the day, if people are changing their AboutMe content a lot, or if there just aren’t that many people browsing users and sorting by that column, then you may be better off just letting that specific paging operation continue to be slow.

What if I can’t change the schema?

If you can’t change the structure of the Users table, there are other ways to achieve a similar outcome. For example, you could have a separate table with just the AboutMePrefix and UserID columns, and maintain it manually (for real-time consistency, using triggers; or, if there can be a sync delay, through other scheduled means). When the sort is specified on that column, you dynamically swap in the new table instead of Users in the original CTE, e.g.:

You could also consider indexed views.

Of course, everything in this post violates pure relational principles, but that’s a problem for another day…

Conclusion

In a future post, I’ll talk about filtering, including trailing wildcard searches against LOB columns – which can use the computed column if the pattern is short enough. Sometimes, though, pagination is just destined to hurt, and for advanced searching of large string columns, it might be time to look outside SQL Server (e.g. Elasticsearch).