Data Generation Strategies

Good, scalable, database performance is largely about putting the data in the right place, being able to access it efficiently, and avoiding unnecessary overheads. At the end of the previous article, I had given myself the task of building a large data set quickly so that I could start examining some of the strategies for data placement in SQL Server 2008 to identify their strengths and weaknesses.

Initially all I wanted to do was emulate the action of an SQL statement I often use in Oracle to generate a few million rows of data, and this article is the story of my wanderings through the SQL Server landscape to find a reasonable, though not necessarily perfect, way of generating the type of data I needed.

A Million Row Table: the Oracle way

Although this was discussed in some detail in Part 1, Listing 1 demonstrates, in Oracle code, the sort of data generation code I was aiming for.

execute dbms_random.seed(0)

create table t1

with generator as (

select rownum id

from dual

connect by

rownum <= 1000

)

select

rownum id,

trunc((rownum-1)/50) clustered,

mod(rownum,20000) scattered,

trunc(dbms_random.value(0,20000)) randomized,

trunc(sysdate) + dbms_random.value(-180, 180) random_date,

dbms_random.string('l',6) random_string,

lpad(rownum,10,0) vc_small,

rpad('x',100,'x') vc_padding

from

generator g1,

generator g2

where

rownum <= 1000000

;

alter table t1 add constraint t1_pk primary key(id);

create index t1_clu on t1(clustered);

create index t1_sca on t1(scattered);

create index t1_ran on t1(randomized);

create index t1_dat on t1(random_date);

create index t1_str on t1(random_string);

Listing 1: Generating a million row table in Oracle

This example is going to give me 1,000,000 rows of data in a simple heap table, with six B-tree indexes, one of which is built implicitly by the addition of the primary key constraint. I have a primary key that has been generated by the pseudo-column rownum, a column of randomly generated strings that will be upper case (‘U’) and six characters in length, a column of randomly generated dates (with time to the nearest second) covering 360 days and centered on “today”, and three numeric columns that are going to hold 20,000 distinct integer values with 50 rows of each value (although that’s an approximation for the column called “randomized”). I needed numeric columns with different patterns in their data distribution because data location has a big impact on the work needed to access the data. I’ve also included a couple of simple character columns, one small, one larger, to make the row lengths look as if they might represent some realistic data and to give me something that isn’t indexed that I can select or update.

Running this code on a small 2.0 GHz laptop, it takes about one minute to create the table, and about eight seconds for each index (mostly time spent re-reading table blocks). If I remove all the columns that make use of the use dbms_random package the time to create the table drops to about 7 seconds, so in practice I make sure that I don’t generate any columns or indexes that I don’t really need, for any given test.

Finally, it’s worth noting that Oracle, unlike SQL Server, has to be explicitly instructed to collect statistics about a table after it has been created, although it does collect statistics on the indexes as you create them. As such, I’ve missed out a step in my code that would be critical to Oracle’s Cost Based Optimizer.

In part 1 I outlined most of the problems with translating this code to SQL Server so I won’t cover them again here. Many of them were little variations in syntax, function names and features and were relatively easy to overcome by diligent searching of the manuals. However, three of the issues were especially relevant to the strategies I attempted:

SQL Server doesn’t have an equivalent of the dbms_random procedure, to generate random data
I’ve yet to overcome this one; essentially I’ll have to find one or write my own. However, I decided to worry about that later when I really need some randomized data. In this article, I’ll just work on ideas for generating the scattered and clustered data patterns. (Update: since writing the article I’ve discovered that using the newid() function is a popular way to generate random data).
SQL Server doesn’t recognize the create as select… syntax
I have to do “create table”, “insert as select …” instead, which is a minor detail and easy to address. In addition, in the discussion on the previous article a couple of readers pointed out the option for a simple “select…into…from” to create and populate a table. This mechanism copies not null constraints but doesn’t handle other constraints, and doesn’t allow you to define a key as you create the table. However, that’s okay with me because at present I’m not too worried about looking closely at indexes.
SQL Server doesn’t have a connect by clause
I had to find a convenient, alternative way of generating a rowsource to play with. As you can see, my strategy with Oracle is to use common table expression (subquery factoring in Oracle speak) to generate a fairly small volume of data, then introduce a Cartesian join (cross product) to expand that to a much larger volume; I’ll explain why later.

The following sections describe the various approaches that I attempted. Unless stated otherwise, the code is run using SQLCMD.

Strategy 1: Row-by-Row

I started with a strategy that I knew was probably a very bad idea (it certainly would be in Oracle): create an empty table and use a program loop to insert one row at a time. This is an easy way to get the counter that I need for my mod() and trunc() games.

I got a little surprise here in that the only loop construct in T-SQL seems to be WHILE…END. I also had a little trouble finding an assignment operator; initially, it seemed as if the only option was to SELECT into variables, but eventually I found the SET operator. Running a few timing tests, the two options seem to perform identically, so maybe they’re effectively the same piece of code behind the scenes, anyway.

Stripped to the minimum, Listing 2 shows what I came up with.

CREATE TABLE big_table

(

id INT ,

clustered_data INT ,

scattered_data INT ,

vc_small VARCHAR(10) ,

vc_padding VARCHAR(100)

) ;

DECLARE @div INT = 50

DECLARE @mod INT = 200

DECLARE @loop INT = 1

DECLARE @limit INT = @div * @mod

BEGIN

WHILE @loop <= @limit

BEGIN

INSERT INTO big_table

( id ,

clustered_data ,

scattered_data ,

vc_small ,

vc_padding

)

VALUES ( @loop ,

FLOOR(( @loop - 1 ) / @div) , -- trunc()

( @loop - 1 ) % @mod , -- mod()

STR(@loop, 10, 0) , -- to_char()

REPLICATE('x', 100) -- rpad()

)

SET @loop = @loop + 1

END

Listing 2: The row-by-row approach

The functions I’ve used are the closest equivalents to the Oracle functions that I’ve referenced in the comments, but they’re not exactly identical. I’ve set up a few variables, rather than fixing constants into the script; after all, I’m hoping to create a template that I can tweak easily to generate different volumes and patterns.

In this case, I restricted myself to 50 distinct values, 200 rows of each, for a total of just 10,000 rows. Performance was, as expected, dire: the code took 19.5 seconds to complete. Regardless of which RDBMS you’re using, doing a big job in lots of little pieces is almost certain to be a very bad idea.

Interestingly, my code above is suffering from the overheads of handling some messaging for every single insert in the loop. In my reading, I stumbled across the NOCOUNT option, and by adding the SET NOCOUNT ON command to the start of Listing 2, I reduced the run time to 5.5 seconds. This is still slow, of course, but gives an interesting insight into how many little details you need to learn before you can criticize a product!

A note on NOCOUNT option: SQLCMD vs. SSMS
When running my code from SSMS, rather than SQLCMD, the run time was still 5.5 seconds regardless of whether or not I set the NOCOUNT option, and I did check that the “one rows processed” messages still appeared in the message window when NOCOUNT was off. So SSMS must be using a different method to SQLCMD?

Strategy 2: Sys.objects, a CTE and row_number

The sys.all_objects view is always going to hold a reasonable number of rows, so why not use that, doing a cross join to itself, to generate a lot of data? I could put the driving query into a common table expression (CTE) before doing the cross join, to make the code look a little simpler. I need a sequential number, so I can play around with the row_number() analytic function inside the CTE and then use a little arithmetic in the cross join.

Listing 3 puts this approach into practice. It seems to work quite well and generated 1,000,000 rows in about 13.5 seconds, running at close to 100% CPU. The definition of big_table is the same as it was in Listing 2.

DECLARE @div INT = 50 ;

DECLARE @mod INT = 20000 ;

DECLARE @limit INT = @div * @mod ;

DECLARE @driver INT = 1000 ;

WITH generator

AS ( SELECT TOP ( @driver )

row_number() OVER ( ORDER BY schema_id ) id

FROM sys.all_objects

)

INSERT INTO big_table

SELECT TOP ( @limit )

@driver * ( g1.id - 1 ) + g2.id ,

FLOOR(( @driver * ( g1.id - 1 ) + g2.id - 1 ) / @div) ,

( @driver * ( g1.id - 1 ) + g2.id - 1 ) % @mod ,

STR(@driver * ( g1.id - 1 ) + g2.id, 10, 0) ,

REPLICATE('x', 100)

FROM generator g1

CROSS JOIN generator g2

OPTION ( FORCE ORDER ) ;

Listing 3: Cross joining sys.all_objects into a CTE

There are a couple of things I don’t like about this code, though. First, it depends on a SYS object and it’s possible that I’m able to see that at present purely because I’m operating as a privileged user. There’s also the potential problem that the number of rows in the view may be smaller than I need for bigger tests; in my current test database it’s less than 2,000 rows.

Another thing I’m not keen on is putting in that FORCE ORDER hint, which I’ve included to make sure that the optimizer joined the two copies of generator in the right order to allow my arithmetic to produce the expected results). I don’t really know how the optimizer is going to deal with the CTE; the execution plan happens to work out the way I want at present but that might just be luck, and could change if I ask for fewer or more rows.

For a little extra clarity I like the idea of removing the repetition of the messy bit of arithmetic, involving the two ids, by putting in an inline view. I’ll demonstrate that in the next example.

A note on “insert…with…select …” instead of “with…insert…select…”
My first attempt at using a CTE to create some data ended in failure(“Msg 102, Level 15, State 1, Server HPBASE, Line 2, Incorrect syntax near ‘generator’“. I had used the Oracle pattern, “insert…with…select …”, where the CTE is part of the select statement, rather than “with…insert…select…”.

Strategy 3: A Recursive CTE

The CTE looks like a good starting point for generating a lot of data, but I’d like to avoid dependencies on anything that may be outside the permission set of the current user.

So how about a recursive CTE (a feature that didn’t appear in Oracle until 11g) that doesn’t query any objects at all?

DECLARE @div INT = 50 ;

DECLARE @mod INT = 20000 ;

DECLARE @limit INT = @div * @mod ;

DECLARE @driver INT = 1000 ;

WITH generator

AS ( SELECT 1 AS id

UNION ALL

SELECT id + 1

FROM generator

WHERE id < @driver

)

INSERT INTO big_table

SELECT id ,

FLOOR(( id - 1 ) / @div) ,

( id - 1 ) % @mod ,

STR(id, 10, 0) ,

REPLICATE('x', 100)

FROM ( SELECT TOP ( @limit )

@driver * ( g1.id - 1 ) + g2.id id

FROM generator g1

CROSS JOIN generator g2

) iv

OPTION ( MAXRECURSION 0, FORCE ORDER ) ;

Listing 4: The recursive CTE approach

This solution performs reasonably well, generating a million rows in about 13 seconds, which is the best yet. However, I had many false starts before reaching this solution, and in fact temporarily discarded the idea completely because I couldn’t make it work. On the first couple of very small test runs the code seemed to work, but it crashed when I finally set @driver to 1,000 to generate my million rows, producing the error message:

1 2	Msg 530, Level 16, State 1, Line 8 The statement terminated. The maximum recursion 100 has been exhausted before statement completion.

At the time, I didn’t consider the possibility that the recursion level was a soft limit that could be set by a hint and it wasn’t until a few weeks after my first attempt that I heard about the hint, during a conversation with a SQL Server expert. The recursion level can be set to any value up to 32,767, after which a setting of zero means “no limit”.

The recursion level wasn’t the only little problem I had with the query: on my first attempt to use the inline view I ran into this error:

1 2	Msg 156, Level 15, State 1, Line 1 Incorrect syntax near the keyword 'option'.

When I removed the option() clause, the error changed to:

1 2	Msg 102, Level 15, State 1, Line 32 Incorrect syntax near ';'

It took me 15 minutes staring at the text before I noticed that I hadn’t given the inline view an alias; this is not a requirement in Oracle and it actually causes an Oracle error to be reported in some cases if you do include one. It always seems to be the smallest things that waste the most time when you’re in an unknown environment!

A note on ANSI join support in SQL Server vs. Oracle
T SQL Server syntax for ANSI joins is far better than Oracle’s. It was a nasty surprise when I first started to write some ANSI code and found that things like the aliases for inline joins and columns, which SQL Server allows, simply didn’t work in Oracle.

Before we move on to the next strategy, you may be wondering why Listing 4 (and my original Oracle source) uses a small recursive CTE with a cross join rather than just using a single large CTE in a statement like that shown in Listing 5 (which gets rid of the FORCE ORDER hint).

WITH generator

AS ( SELECT 1 AS id

UNION ALL

SELECT id + 1

FROM generator

WHERE id < @driver -- set to something BIG

)

INSERT INTO big_table

SELECT id ,

FLOOR(( id - 1 ) / @div) ,

( id - 1 ) % @mod ,

STR(id, 10, 0) ,

REPLICATE('x', 100)

FROM generator g1

OPTION ( MAXRECURSION 0 ) ;

Listing 5: A simpler, but much slower, CTE approach

The answer is that recursion is expensive in most coding environments, so I had simply assumed it probably would be in SQL Server 2008. In my Oracle example, the basic cost of generating 1,000,000 rows jumped from 7 seconds to 11 seconds when I switched from a small join to a single use of the CTE; the SQL Server 2008 example jumped from 13 seconds to 48 seconds.

Strategy 4: A Table-Valued Function

After temporarily abandoning the recursive CTE approach, I thought I’d try another “data-free” strategy, namely a table function (known in Oracle-speak as a pipelined function). I’ll start with defining the table function, as shown in Listing 6, and then explain the rationale.

CREATE FUNCTION table_function ( @i_in AS INTEGER )

RETURNS @table_result TABLE ( id INT )

BEGIN

DECLARE @loop INT = 1 ;

WHILE @loop <= @i_in

BEGIN

INSERT INTO @table_result

VALUES ( @loop ) ;

SET @loop = @loop + 1 ;

END ;

RETURN

END

Listing 6: The TVF approach

The table function has a return type which, as its name suggests, is a table and I have, in effect, declared a variable of type table and then written a simple loop that inserts rows into this table (in a variant of this test, I used the recursive CTE mechanism to generate the result set for the table function). The table name is @table_result, which indicates, through the @ symbol, that the table is a form of temporary table.

A note on temporary tables
One of the comments on the first article in this series pointed out that there are three different types of temporary table: #table_name for local scope, ##table_name for global scope, and @table_name for inner-local scope, so I’ve got plenty of work to do on checking how they work,what they cost , and whether I’ve used the right type here.

To use this function, I simply need to tweak the definition of my CTE, as shown in Listing 7.

WITH generator

AS ( SELECT id

FROM table_function(@driver)

)

INSERT INTO big_table

SELECT id ,

FLOOR(( id - 1 ) / @div) ,

( id - 1 ) % @mod ,

STR(id, 10, 0) ,

REPLICATE('x', 100)

FROM ( SELECT TOP ( @limit )

@driver * ( g1.id - 1 ) + g2.id id

FROM generator g1

CROSS JOIN generator g2

) iv

OPTION ( FORCE ORDER ) ;

Listing 7: Using the TVF with the generator CTE

This code took 15 seconds to complete, which is 2 seconds longer than the recursive CTE. I had assumed that the difference would be virtually non-existent, but perhaps the 2 seconds is the impact of the temporary table, or a difference due to the memory calls made for a table function. If this wasn’t just for a quick and dirty data builder I’d need to find out why the difference was as large as it was. In the meantime, though, I’ll just put this anomaly on my to-do list.

Again it’s worth a quick check to see what the impact would be of changing the cross-join into a single call to the table function for 1,000,000 rows: the answer is similar to the example with the recursive CTE; the run time changed from 15 seconds to 48 seconds. Again, we see that building a large object in memory can be very CPU intensive, while the SQL approach, in effect generating and consuming in a continuous stream, is much more efficient.

Strategy 5: Using a Temp Table with the CTE

At this point, two things occurred to me. First, I really needed to know how SQL Server was generating the data, so an investigation into generating and reading execution plans had become very important. Secondly, it struck me that I had been thinking too much in terms of Oracle-style solutions.

I had to try to think in terms of what might be a natural “SQL Server way” of attacking this problem, and probably there were a lot of SQL Server-style things I should have done before I got to this point (I’m sure you’ll let me know!).

Perhaps if I stopped playing around with recursive CTEs and table functions I could do something more efficient. The code in Listing 8 looks very similar to Listing 4 but there is an important intermediate step that I thought might make a difference.

WITH generator

AS ( SELECT 1 AS id

UNION ALL

SELECT id + 1

FROM generator

WHERE id < @driver

)

SELECT *

INTO #generator

FROM generator

OPTION ( MAXRECURSION 0 ) ;

INSERT INTO big_table

SELECT id ,

FLOOR(( id - 1 ) / @div) ,

( id - 1 ) % @mod ,

STR(id, 10, 0) ,

REPLICATE('x', 100)

FROM ( SELECT TOP ( @limit )

@driver * ( g1.id - 1 ) + g2.id id

FROM #generator g1

CROSS JOIN #generator g2

) iv

OPTION ( FORCE ORDER ) ;

Listing 8: Using a temp table with the CTE

The difference is that I’ve created a temporary table to hold my driving data set. One of the mantras of the Oracle enthusiast is that “you don’t need temporary tables” (it’s not true of course, but we usually hide them behind CTEs and non-mergeable views). In contrast, I’d noticed that in SQL Server they are quite openly and commonly used. With this temporary table in place I can be confident (I think!) that I won’t be re-generating a recursive CTE 1,000 times, or calling a table function 1,000 times. So, if any of my strategies so far was going to be noticeably quicker than the rest, my money was on this one. I was wrong. In fact, this version of the code took slightly longer than all the other versions. I have no idea why, of course, but maybe it’s just something to do with a slight difference in the way the temporary table is used in this case; time to start learning about XEvents, perhaps?

Concluding notes

After a few hours of playing around, I’ve got a way of generating useful volumes of data. I haven’t found a mechanism that is quite as fast as Oracle, but the timing is in the same ballpark, and when I got around to adding the PRIMARY KEY constraint and other indexes, the timing advantage swung the other way.

An interesting point here is that in Oracle, I added the PRIMARY KEY constraint with a single command but in SQL Server I had to add a NOT NULL constraint first. Is this a significant difference? When you know where to look (and I do when I’m using Oracle) you can uncover all sorts of interesting details. In fact, behind the scenes, adding a PRIMARY KEY constraint in Oracle requires two tablescans, one to add the NOT NULL constraint and one to create the index. So the database goes through the same amount of work as SQL Server, but Oracle is hiding two actions behind one command.

Of the various mechanisms I’ve tried, I’m going to use the recursive CTE of strategy 3 in the future, simply because it’s the most self-contained of the options I’ve considered.

I still need to sort out a reasonable method of generating pseudo-random data though, and I am curious about the join method that SQL Server has selected for the cross join that I’ve been using in all my non-procedural examples, so it’s nearly time to start looking closely at execution plans.

I’ll finish off with a couple of points about differences and similarities between the million row tables built by Oracle and SQL Server:

In Oracle my table spanned 18,261 blocks (pages) of 8KB while SQL Server spanned only 16,950 pages
Note to self: by default Oracle leaves 10% free space in a block for row lengths to increase on update, does SQL Server do something similar by default? (It’s got a fillfactor for indexes)
In Oracle, the primary key index spanned 2,087 blocks (again with 10% free space; the equivalent of “fillfactor 90 pad_index off” in SQL Server) while SQL Server spanned 2,228 pages. Note to self: is this because of a different storage strategy, or because of a different free space allocation?

In the next couple of articles I will be looking more closely at where my data is, and how SQL Server can choose to access it – because without this type of information it’s not possible to design an efficient database.

If all you want to do is to generate test data in a table incredibly quickly ….
“I generated better data in only seconds…” Michael Gaertner of Quintech was impressed with SQL Data Generator, a tool for populating databases with realistic test data. Download your free trial of SQL Data Generator.

Register for Simple Talk

Oracle to SQL Server: Crossing the Great Divide, Part 2

Data Generation Strategies

A Million Row Table: the Oracle way

Strategy 1: Row-by-Row

Strategy 2: Sys.objects, a CTE and row_number

Strategy 3: A Recursive CTE

Strategy 4: A Table-Valued Function

Strategy 5: Using a Temp Table with the CTE

Concluding notes

Article tags

About the author

Jonathan Lewis

Jonathan's contributions

Articles

Books

Top topics

Jonathan's latest contributions:

Oracle optimizer Or Expansion Transformations

Oracle optimizer removing or coalescing subqueries

Oracle subquery caching and subquery pushing

Data Generation Strategies

A Million Row Table: the Oracle way

Strategy 1: Row-by-Row

Strategy 2: Sys.objects, a CTE and row_number

Strategy 3: A Recursive CTE

Strategy 4: A Table-Valued Function

Strategy 5: Using a Temp Table with the CTE

Concluding notes

Article tags

Recommended

About the author

Jonathan Lewis

Jonathan's contributions

Articles

Books

Top topics

Jonathan's latest contributions:

Oracle optimizer Or Expansion Transformations

Oracle optimizer removing or coalescing subqueries

Oracle subquery caching and subquery pushing