The single most common barrier to writing efficient database code, for a developer versed in a typical client language (such as C#), is an inability to leave behind the habit of working on one row at a time, in favor of working with a set (a collection of rows).
What is the difference? Let’s examine a very simple example of summing up 5 rows in a table. The following code creates a table variable (@Sample) and populates it with the rows of data.
1 2 3 4 5 6 7 8 9 10 11 12 |
DECLARE @Sample TABLE ( Data INT ) INSERT @Sample ( Data ) VALUES ( 1 ), ( 2 ), ( - 4 ), ( 5 ), ( 8 ) SELECT Data FROM @Sample |
For a .NET developer this pseudo-code would be the way to perform this simple sum:
1 2 3 4 5 6 7 8 |
Dim intS AS Integer = 0, 'Initialise the sum to zero objR AS Row For Each objR In coll.Rows intS += objR.Data Next Debug.Print intS |
This is the standard approach for a procedural language; we not only tell the compiler what we want (the sum of all objects) but also how we want to approach the objects when calculating the sum. In this example, the coll variable holds a collection with objects of type Row. The For-Next loop iterates through all the objects and, for each object, fetches the current value from the Data property and adds the value to the intS variable. In other words the code operates on one row at a time.
T-SQL is a declarative language which means that the programmer is supposed to tell the database server what is required but let SQL Server decide the best way to finalize the task. However, many programmers force the procedural approach on SQL Server, with code such as the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
DECLARE curSum CURSOR FOR SELECT Data FROM @Sample DECLARE @s INT , @Total INT = 0 OPEN curSum FETCH NEXT FROM curSum INTO @s WHILE @@FETCH_STATUS = 0 BEGIN SET @Total + = @s FETCH NEXT FROM curSum INTO @s END CLOSE curSum DEALLOCATE curSum SELECT @Total |
Even though this code is written in T-SQL, it still operates on one object (row) at any given time. This approach is commonly referred to as Row-by-Agonizing Row (RBAR). If we increase the number of rows tenfold, the time to complete the task would increase by a factor of ten; for a very large table (100 million rows or more) this sum could easily take hours to complete.
So how do you do this in a set-based way? Luckily for us, T-SQL has an aggregate function available to do the job for us.
1 2 |
SELECT SUM ( Data ) FROM @Sample |
The following table compares the performance of the cursor method, versus the set-based method, for an increasing numbers of rows.
Rows | CURSOR | SET BASED | Performance Factor | |||
DURATION (mS) | READS | DURATION | READS | DURATION | READS | |
1k | 47 | 4009 | 0 | 7 | – | 953 |
10k | 492 | 40004 | 3 | 42 | 164 | 953 |
100k | 5189 | 400389 | 28 | 388 | 179 | 1039 |
1000k | 51358 | 4003848 | 286 | 3847 | 180 | 1041 |
As you can see, both solutions scales linearly i.e. when we increase the number of rows by a factor of ten, we increase the query duration by roughly the same factor. However, there is one big difference, and that is that the set-based solution is about 180 times faster, and puts only about 1/1000th of the read pressure on the server!
So why is the performance difference so vast? We’ll discuss this in full detail in a later article in this series, but in essence, the set-based approach allows SQL Server to return the all the required rows very efficiently, fully exploiting the underlying data storage mechanisms. The internal structure where the data is stored in SQL Server is called a Page and it defines the minimum amount of data read or written to the database, regardless of whether you want all rows in that Page or only one row. So, with a set-based approach, we only need to fetch a page once to get all x rows stored within, whereas with the RBAR approach, we must re-fetch the page x times. We’ll also see in a later article how, with appropriate indexes in place, we can speed up this process still further.
This paradigm shift is all there is to it! SQL Server has some built-in aggregate functions to accomplish the task of summing a number of rows, and many others. The key difference is that you, in set-based thinking, instead of working with individual rows, we work with sets which can contain any number of rows.
To become a professional database developer you have to abandon the old habit of thinking what you want to do with each row, and start thinking instead about what you want to do with the set, by manipulating the columns. Think vertical instead of horizontal!
If we take this example a step further, considering the task of multiplying together all the rows, we will encounter the second biggest barrier to becoming a professional database developer, and that is math skills.
For a .NET developer, this pseudo code would be the way to go:
1 2 3 4 5 6 7 8 |
Dim intP AS Integer = 1, ' Initialise the product to 1 objR AS Row For Each objR In coll.Rows intP *= objR.Data Next Debug.Print intP |
However, we just established that we no longer want to work with single rows, right? Unfortunately, however, while SQL Server provides a handy SUM aggregate function, it does not provide a MULTIPLY aggregate function. So, how do we perform this task in a set-based fashion? This is where your math skills become handy.
1 |
A * B * C equals e^(ln(A)+ln(B)+ln(C)) |
Having transformed the multiplication into a sum, it’s easier to see that the final query should look something like this:
1 2 |
SELECT EXP ( SUM ( LOG ( Data ))) FROM @Sample |
But we still have an issue here; logarithmic operations are only allowed on positive values, and not including zero (logarithmic operations on zero and negative values are undefined in natural numbers and will give an Invalid Floating Point operation in T-SQL), and multiplication with zero, under any circumstances, is equal to zero. Also, remember that a multiplication with an odd number of negative numbers results in a negative product and multiplication with an even number of negative number results in a positive product. Here is one example of how to solve this problem:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
IF EXISTS( SELECT * FROM @Sample WHERE Data = 0 ) SELECT 0.0E -- If any value is zero, the total product is zero. ELSE SELECT CASE IsNegativeProduct WHEN 1 THEN - EXP ( theSum ) ELSE EXP ( theSum ) END FROM ( SELECT SUM ( LOG ( ABS ( Data ))), -- Sum all exponents -- Keep track if product is negative or positive SUM ( CASE WHEN Data < 0 THEN 1 ELSE 0 END ) % 2 FROM @Sample ) AS d ( theSum , IsNegativeProduct ) |
The main reason for the checks in the above code is the limitations of the LOG function. So, can we complete the task in another fashion and still produce the correct result? Yes, we can:
1 2 3 4 5 6 |
DECLARE @p INT = 1 UPDATE @Sample SET @p * = Data SELECT @p |
This code, which looks very much like the C# pseudo code, produces the right answer, a value of -320, but how did it work? The difference is that we tell T-SQL to multiply the values but we do not force T-SQL to retrieve the rows in any order. Another reason we can use the code above is due to the nature of a product. If we want the product of A and B (a * b), the reverse is also valid (b * a); it doesn’t matter which order we take the values in the multiplication, as long they are of same data type.
This leads us to the third ‘light bulb’ moment on the road to becoming a professional database developer: understanding that rows in a database have no order. This is very different to the way .NET developers think. A collection of objects is normally a double-linked list, where you can traverse forward and backward. Rows in a table can be logically ordered by assigning a sequence number or a sequence denoted by date and time, but you can never rely on the physical order of rows in a set.
Summary
- Think in sets and not in single rows
- Brush up your math skills as much as you possibly can
- Rows have no determined physical order
Coming Next….
Efficient storage and retrieval: Normalization rules.