18 May 2018
18 May 2018

Finding code smells using SQL Prompt: TOP without ORDER BY in a SELECT statement

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Using TOP in a SELECT statement without a subsequent ORDER BY clause is legal in SQL Server, but meaningless because asking for the TOP 10 rows implies that the data is guaranteed to be in a certain order, and tables have no implicit logical order. You must specify the order.

In a SELECT statement, you should always use an ORDER BY clause with the TOP clause, to specify which rows are affected by the TOP filter. If you need to implement a paging solution in an application widget, to send chunks or “pages” of data to the client so a user can scroll through data, it is better and easier to use the OFFSETFETCH subclause in the ORDER BY clause, instead of the TOP clause.

A recommendation to avoid use TOP in a SELECT statement, without an ORDER BY, is included as a “Best Practice” code analysis rule in SQL Prompt (BP006).

Limiting rows with TOP

TOP isn’t standard SQL, but it is intuitive. If you want just a few sample rows from a table source, then it is tempting to use the TOP keyword without the ORDER BY clause. A single table is likely to conform to the order of the clustered index but even that isn’t guaranteed due to parallelism.

If we go beyond the query for a single table and do a few joins, the ‘natural’ order is less obvious. Maybe you are in AdventureWorks and want just five customers, any five customers, and their addresses. It is perfectly legal to do this, but it is slightly dangerous if you subsequently forget why you are doing it.

Listing 1

You get what you expect, just the first five customers that are returned by the query. The order I get is that of the clustered index of the Sales.Customer table’s PK_Customer_CustomerID, starting with the customer with the lowest customer_id that is a person rather than a store. A different execution strategy could well change that. You are not guaranteed a deterministic result. This may be fine if you just want a sample during development, but in production systems what you really want are the top five customers and addresses according to their rank order, as determined by some attribute such as how much they spend. You really need that ORDER BY.

In short, SQL tables have no guaranteed consistency in its intrinsic order. You may have set a PRIMARY KEY that gives your table values some fundamental ordering, but this isn’t guaranteed to happen consistently. SQL Server reserves the right to introduce any optimisation it wants, in creating an execution plan that will return the results, even if it means delivering those results in a different order. In short, you can’t guarantee that the results will be returned in the order you expect, unless you make it explicit with an ORDER BY statement.

So, we return to the perfectly reasonable requirement that the developer has of just being able to eyeball a representative sample of rows from the query. How should this be done?

SET ROWCOUNT and TABLESAMPLE: do they help?

There was a time when we had to use the SET ROWCOUNT statement to put a limit on the number of rows returned. One drawback to that was that the query optimiser couldn’t create an effective plan in the light of the number of rows requested, because the ROWCOUNT is a session or procedure/trigger-wide setting that wasn’t visible to the Query Optimiser within the query.

Also, it was possible to forget that you’d set the ROWCOUNT and neglected to ‘unset’ it. Another disadvantage was that you couldn’t pass the value a variable. TOP is better, because it works at the statement level, and you can pass either a row value, or a percentage, as a variable or expression.

You might think that you can reliably get a limited number of rows from a table with the TABLESAMPLE clause. The only problem with it is that it doesn’t work as advertised, and even if it did work as advertised it only works on tables, rather than the whole rich variety of table sources.

Listing 2

This is supposed to limit the number of rows returned from a table in the FROM clause to a sample number or PERCENT of rows. A quick test will show you why nobody uses it.


Listing 3

Getting meaningful table samples with TOP…ORDER BY

So far, because TABLESAMPLE is broken, we are stuck with the slight awkwardness of this for getting a sample from a table.

Listing 4

Why is this awkward? Listing 4 will give you back five rows, but you cannot entirely rely on which rows are returned, though it is likely to be in the order of the PRIMARY KEY, as we are just accessing a single table. However, sales.customer is a bit of a ‘trick’ table in that it uses polymorphic association, and the first 700 of the 19820 rows represent stores, not people. So, Listing 4 is likely to give a very unrepresentative sample of this table because You could easily get the incorrect impression of the data in the table, thinking that customers are stores when the majority are people!

What most developers will want is to see a few rows of the table they are investigating, taken at random, but if you want the sample in random order, then you have to make that explicit.

Listing 5

This will return you five rows in a random order, but it takes more resources to return the result. If you aren’t dealing with a ‘trick’ table and just don’t care about order, and need to indicate that fact in the code, then SQL Server will accept any system function, such as @@version or host_name(), even ORDER BY (SELECT NULL). This trick is often seen where SQL Server refuses to accept code in a window function that requires an ORDER BY. It means ‘I know, I know but I’m doing it deliberately‘.

Listing 6

If you are happy with the records you get by using TOP without the ORDER BY, then it is best to be entirely explicit, and point out that you really do want it by the PRIMARY KEY field

Listing 7

Using TOP with ORDER BY for reporting queries

TOP comes into its own for reporting purposes. Managers love lists of top customers and top salesmen. At this point, the ORDER BY part becomes crucial.

Listing 8

This gives you the top ten performing salespeople.

We might decide that this report doesn’t really tell us who the salesperson was, so we’d tweak it.

Listing 9

Why would we need to do that second ORDER BY? The original SQL was an aggregate query and we needed the top 10 aggregate sales totals, so we had to impose an order on it. This was passed, with no fixed order, to the outer query that added the person’s name. To be certain of the order of this outer query, it will need an explicit ORDER BY clause too. This is sometimes referred to as a ‘presentation ORDER BY‘, or ‘presentation ordering’.

Using FETCH-OFFSET instead of TOP

The better way of doing this is, of course, to use the ORDER BY with the optional OFFSETFETCH clause, available in SQL Server 2012 onwards, instead of TOP. It is much more versatile and is standard ANSII SQL too. Here are the twenty longest-serving employees of AdventureWorks.


Listing 10

Now, with ORDER BY…OFFSET…ROWS FETCH FIRST…ROWS ONLY, you can provide a means of scrolling or paging through this hall of fame.

Using TOP with INSERT, UPDATE, MERGE, or DELETE

It seems odd that you are discouraged from using TOP without and ORDER BY, when in certain circumstances you are actively prevented from doing so. As well as the SELECT statement, the DELETE, INSERT, MERGE and UPDATE statements all have a TOP clause. In contrast to SELECT, you can’t have an associated ORDER BY clause. Let’s take this example.


Listing 11

Now try putting in an ORDER BY clause! It won’t allow it. As the documentation says:

“The rows referenced in the TOP expression used with INSERT, UPDATE, MERGE or DELETE are not arranged in any order”.

No, you must do something like this.

Listing 12

Likewise, an INSERT statement. We can’t use TOP to insert rows in a meaningful chronological order. As the book says:

“When TOP is used with INSERT the referenced rows are not arranged in any order and the ORDER BY clause cannot be directly specified in this statement.”

If you need to do this, you must use TOP together with an ORDER BY clause that is specified in a sub-select statement.

DELETE has a TOP clause, but we can’t use it either. What if you wanted to clean out old purchase order details? You’d want to be certain to clean out the oldest ones first. We can’t put an ORDER BY in the delete statement, but then we don’t have to.

Let’s set up the test.

Listing 13

We now delete the ten oldest purchase order details.

Listing 14

So, what is the point of having that TOP filter if it can’t be used to DELETE, INSERT, MERGE or UPDATE? Well, in fact, it can be used, in circumstances where a particular set of records don’t need to be deleted in a specific order, as long as they eventually get deleted.

The use of the TOP filter without the ORDER BY is a life-saver if you need, for example, to delete lots of rows, regularly, from an operational production system. Deletes are logged, and can also result in escalating locks. I once had to design a system that regularly cleared out a million rows from a SQL Server database. The optimum method involved taking a lot of successive bites when eating the elephant rather than attempting it in a mouthful.

We can illustrate this very easily, though you wouldn’t see the advantage until you have a working system, especially one that was needing to access the table while you were deleting, updating, inserting or merging. Again, we will illustrate the point with a temporary table so as not to interfere with the good working of AdventureWorks.

Listing 15

I’ve found, in the past, that large-scale operations like this often benefit from being done in chunks, and the size of the chunk is a matter of fine-tuning with the operational system to get it right. For this sort of work, the TOP clause without the ORDER BY in a DELETE, INSERT or UPDATE can be very valuable for doing large-scale changes, a short step at a time, in a hard-working transactional system

Summary

SQL Server’s TOP clause in a SELECT statement is very useful and intuitive, but it allows you to leave out the associated ORDER BY clause that clarifies what you had in mind: TOP by what aspect? After all, your TOP ten songs aren’t the ten loudest, or the ten sung in the highest voice. They are the ten most popular in terms of the records sold. You may, fortuitously get the right result in development work, but in production, the workload, the server and the data size could result in the queries being optimised in very different ways, resulting in a different result.

For a more versatile way of dealing with such thing, I’d suggest using the ORDER BYOFFSETFETCH syntax introduced in SQL Server 2012 because it is more versatile and conformant. It isn’t too much harder to remember than the TOP filter either.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Related posts

Also in Hub

Documenting your Database with SQL Change Automation

It would be wrong to portray SQL Change Automation (SCA) as being suitable only for epic project deployments, of the sort described in my previous article. It can do smaller tasks as well. To demonstr...

Also in SQL Prompt

SQL Prompt Code Analysis: INSERT INTO a permanent table with ORDER BY (PE020)

The SQL query that is used to produce the result that is inserted into the permanent table has its order specified by an ORDER BY statement. Relational tables are not ordered, so the ORDER BY is meani...

Also in Product learning

Spoofing Realistic Credit Card Data for your Test Systems using Data Masker

Data protection and privacy regulations, ranging from GDPR to HIPPAA to PCI, among many others, put strict compliance requirements on the storage and use of personal and sensitive data, in any of your...

Also about static code analysis

Improving legacy code using SQL Prompt code analysis

Code analysis is a useful technique for identifying potential issues in your code base, and ensuring that it adheres to your pre-defined standards, so that you can release your software product with m...