Generating Repeatable Sets Of Test Rows

In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.

In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID(), then using RAND() to allow the code to generate a fixed set of data.

The Setup

As the base data, I am going to create a table named: Demo.FetchRandomDataFrom that will serve as my stand-in for any table you want to randomly fetch data from.

USE Tempdb;

CREATE SCHEMA Demo;

SET NOCOUNT ON;

DROP TABLE IF EXISTS Demo.FetchRandomDataFrom

CREATE TABLE Demo.FetchRandomDataFrom

(

--this value should be the primary key of your table,

--ideally using the clustered index key for best

--performance, but definitelu unique.

FetchRandomDataFromId INT IDENTITY(2,4)

CONSTRAINT PKFetchRandomDataFrom PRIMARY KEY,

SomeOtherData CHAR(1000) NOT NULL

)

--Make the rows of a consequential size. Small row size often

--is an issue with random sets of data you want to use because

--it performs different than real tables.

INSERT INTO Demo.FetchRandomDataFrom

( SomeOtherData )

VALUES (REPLICATE('a',1000));

GO 100000 --create 100000 rows

Let’s take a look at a few rows in my table. Note that I made the surrogate key start at 2 and increment by 4. the goal was to make good and sure that the data was obviously not a sequential number. If your keys are pretty much sequential, you may not need to do some steps I have included, but unless it is a perfect set without gaps, that may come into play when trying to make repeatable outcomes from the same data.

SELECT TOP 10 *

FROM Demo.FetchRandomDataFrom

ORDER BY FetchRandomDataFromId

This returns:

FetchRandomDataFromId SomeOtherData

--------------------- ----------------------------------

2 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

6 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

10 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

14 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

18 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

22 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

26 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

30 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

34 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

38 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

This will be my stand in for a more realistic set of data where SomeOtherData would actually have been number, dates, values, etc.

Using NEWID()

If you simply need a random set of rows from a table, you can simply do something like this:

SELECT TOP 10 FetchRandomDataFromId

FROM Demo.FetchRandomDataFrom

ORDER BY NEWID();

This just sorts the data in a random order based on the NEWID() GUID values being generated. It is a wonderful technique you can do with almost any set of data, Executing the previous statement will return a random set of rows each time you execute it, such as:

FetchRandomDataFromId

---------------------

177818

138446

272154

171854

195266

105330

314422

203530

395666

365046

There are data sampling code types in SQL Server syntax, which you can read about here in an article by Derek Colley, but like this article also quotes, if you want a random set, most basic methods use NEWID() to get truly random outputs.

The biggest pro with that method is that it is simple and very much random. However, there are two downfalls to this method. First, no duplicated values will be found in that set. For most cases, that is great, but in other cases, you may wish to have the chance of the same value being generated multiple times (in my real case, I absolutely want the same node to end up connected to more than one node. Second, it is non repeatable. If you want to send this code to someone else, they will always get a different answer. Obviously, this may or may not be an issue, but I can’t count how many times I have generated a set, found something interesting about it and then lost the set.

Repeatable Random values

In order to make a random enough set that was repeatable, I then decided to take advantage of the RAND() function and one of its most interesting capabilities. This being that it isn’t really technical random as you can pass any number into it as a seed, and then all the next values generated are technically deterministic.

For example, execute the following code on any SQL Server:

SELECT RAND(1)

UNION ALL

SELECT RAND()

UNION ALL

SELECT RAND()

UNION ALL

SELECT RAND()

UNION ALL

SELECT RAND()

UNION ALL

SELECT RAND()

The output will be:

----------------------

0.713591993212924

0.472241415009636

0.943996112912269

0.0088524383111658

0.343443392823765

0.210903950196124

I got the same output on SQL Server 2019 and 2022 RC0. Even if you execute RAND(1) and then an hour later, the rest of the code, the “random” value that is output will be the same. So, in the following code, I will use this by letting the user set a seed (or just take a random seed when exploring data sets, but then telling the user the seed in case they wish to use it again). The following code builds on this concept using RAND() and loading a temp table of random data:

It is a lot of code, but I included quite a bit of comments to make it clear:

DROP TABLE IF EXISTS #HoldFetchRandomDataFrom

CREATE TABLE #HoldFetchRandomDataFrom

(

FetchRandomDataFromId INT PRIMARY KEY

)

DECLARE @FetchRandomDataFromId INT, --hold the fetched value

@RowsLoopCounter INT = 0, --number of rows fetched

@LoopCounter INT = 0, --number of loops made,

--to stop if too many

@RowsInTableCount INT, --Used to be able to set the

--RAND value

@SafetyFactor NUMERIC(3,1) = 2, --this times the loop

--counter for max loops

@DesiredRowCount INT = 100, --number of rows you want

--in output

@Seed INT = 1; --set specific value if you want

--repeatability

--default the seed (doing this, I give it a seed that

--is repeatable and will override the one that was set

--in the next execution unless you reconnect.

SET @seed = COALESCE(@seed,2147483647 * RAND())

--sets the seed to start generating

DECLARE @SetSeed INT = RAND(@seed)

SELECT CONCAT('Seed used: ',@seed);

--load all of the data into this table. The PrepId column will

--hold a sequential value to make random number easier

DROP TABLE IF EXISTS #PrepFetchRandomDataFrom

CREATE TABLE #PrepFetchRandomDataFrom

(

--Start at 0 because RAND can produce 0

PrepId INT IDENTITY(0,1),

FetchRandomDataFromId INT NOT NULL

)

INSERT INTO #PrepFetchRandomDataFrom

(

FetchRandomDataFromId

)

SELECT FetchRandomDataFromId

FROM Demo.FetchRandomDataFrom;

--the random number will be this times RAND. (it has to be

--captured in a generic bit of code because you could choose a

--value > the number of rows

SET @RowsInTableCount = (SELECT COUNT(*)

FROM #PrepFetchRandomDataFrom);

WHILE 1=1

BEGIN

--this tells you the number of loops done.

SET @Loopcounter = @LoopCounter + 1;

--this way, if some error keeps happening, you will

--eventually stop

IF @LoopCounter >= @DesiredRowCount * @SafetyFactor

BEGIN

RAISERROR('Loop stopped due to safety counter',10,1);

BREAK;

END;

--fetch one random row

SELECT @FetchRandomDataFromId = RAND() * @RowsInTableCount

--do the insert in a try. You may have criteria in your

--table that needs to be met (uniqueness, for example)

BEGIN TRY

--join to get the real value from the table

INSERT INTO #HoldFetchRandomDataFrom(FetchRandomDataFromId)

SELECT FetchRandomDataFromId

FROM #PrepFetchRandomDataFrom

WHERE PrepId = @FetchRandomDataFromId

--this is set inside the try, because the row has been

--added

SET @RowsLoopCounter = @RowsLoopCounter + 1;

--this is then end if the rows inserted match the desired

--rowcount

IF @RowsLoopCounter = @DesiredRowCount

BREAK;

END TRY

BEGIN CATCH

--most likely error is a duplicated value

DECLARE @msg NVARCHAR(1000);

--display the message in case you do something such

--that the loop can never complete

SET @msg = CONCAT('Some error occurred:'

,ERROR_MESSAGE())

RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message

--Note, we keep going. This is what the safety factor

--was for, because an error could cause infinite loop

END CATCH;

END;

Output the rows:

SELECT *

FROM #HoldFetchRandomDataFrom

ORDER BY FetchRandomDataFromId;

Now since I set the seed value to 1, you will see the same output:

-----------------------

Seed used: 1

FetchRandomDataFromId

---------------------

1670

2870

3542

4606

6982

10626

12846

Clearly this is a LOT more code than simply TOP 1 ORDER BY NEWID(). Am I saying this is always necessary? Not at all. But it is a technique that you can use to generate a repeatable random set of data (as far as you care it to be random).

For example, you could test some code with a random set of data that you find an error. Instead of having to save that data off for the next person, you could just use the script. My reason for doing this is for a book I am working on. I want to be able to send a set of edges to the reader without making this huge script. So I needed two correlated random values. This method made that pretty easy.

What made me first try this was I wanted to generate edges, which need a from value and a to value. When I tried to do this with the NEWID() method in a loop 300000 times, it took close to forever to complete (I gave up at 150000 rows after that executed all night).

That is how this next piece of code was written:

100

101

102

103

104

105

106

DROP TABLE IF EXISTS #HoldFetchRandomDataFrom

CREATE TABLE #HoldFetchRandomDataFrom

(

FetchRandomDataFromId1 INT NOT NULL,

FetchRandomDataFromId2 INT NOT NULL,

--no duplicate rows

PRIMARY KEY (FetchRandomDataFromId1,

FetchRandomDataFromId2),

--no self relationship. Needs to be different values

CHECK (FetchRandomDataFromId1 <> FetchRandomDataFromId2)

)

DECLARE @FetchRandomDataFromId1 INT, --hold the first fetched

--value

@FetchRandomDataFromId2 INT, --the second fetched value

@RowsLoopCounter INT = 0, --number of rows fetched

@LoopCounter INT = 0, --number of loops made, to stop if

--too many

@RowsInTableCount INT, --Used to be able to set the RAND

--value

@SafetyFactor NUMERIC(3,1) = 2, --this times the loop

--counter for max loops

@DesiredRowCount INT = 100, -- number of rows you want

--in output

@Seed INT = 1; --set specific value if you want

--repeatability

--default the seeds (doing this, I give it a seed that

--is repeatable and will override the one that was set

--in the next execution unless you reconnect.

SET @seed = COALESCE(@seed,2147483647 * RAND())

--sets the seed to start generating

DECLARE @SetSeed INT = RAND(@seed)

SELECT CONCAT('Seed used:',@seed);

--load all of the data into this table. The PrepId column will

--hold a sequential value to make random number easier'

DROP TABLE IF EXISTS #PrepFetchRandomDataFrom

CREATE TABLE #PrepFetchRandomDataFrom

(

--Start at 0 because RAND can produce 0

PrepId INT IDENTITY(0,1),

FetchRandomDataFromId INT NOT NULL

)

INSERT INTO #PrepFetchRandomDataFrom

(

FetchRandomDataFromId

)

SELECT FetchRandomDataFromId

FROM Demo.FetchRandomDataFrom;

--the random number will be this times RAND. (it has to be

--captured in a generic bit of code because you could choose a

--value > the number of rows

SET @RowsInTableCount = (SELECT COUNT(*)

FROM #PrepFetchRandomDataFrom);

WHILE 1=1

BEGIN

--this tells you the number of loops done.

SET @Loopcounter = @LoopCounter + 1;

--this way, if some error keeps happening, you will

--eventually stop

IF @LoopCounter >= @DesiredRowCount * @SafetyFactor

BEGIN

RAISERROR('Loop stopped due to safety counter',10,1);

BREAK;

END;

--fetch one random row

SELECT @FetchRandomDataFromId1 = RAND() * @RowsInTableCount;

SELECT @FetchRandomDataFromId2 = RAND() * @RowsInTableCount;

--do the insert in a try. You may have criteria in your

--table that needs to be met (uniqueness, for example)

BEGIN TRY

--join to get the real value from the table

INSERT INTO #HoldFetchRandomDataFrom

(FetchRandomDataFromId1,FetchRandomDataFromId2)

VALUES((SELECT FetchRandomDataFromId

FROM #PrepFetchRandomDataFrom

WHERE PrepId = @FetchRandomDataFromId1),

(SELECT FetchRandomDataFromId

FROM #PrepFetchRandomDataFrom

WHERE PrepId =

@FetchRandomDataFromId2));

--this is set inside the try, because the row has

--been added

SET @RowsLoopCounter = @RowsLoopCounter + 1;

--this is then end if the rows inserted match the desired

--rowcount

IF @RowsLoopCounter = @DesiredRowCount

BREAK;

END TRY

BEGIN CATCH

--most likely error is a duplicated value

DECLARE @msg NVARCHAR(1000);

--display the message in case you do something such

--that the loop can never complete

SET @msg = CONCAT('Some error occurred:'

,ERROR_MESSAGE())

RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message

--Note, we keep going. This is what the safety factor

--was for, because an error could cause infinite loop

END CATCH

END;

--Output the rows:

SELECT *

FROM #HoldFetchRandomDataFrom

ORDER BY FetchRandomDataFromId1;

Executing this returned:

---------------------

Seed used:1

FetchRandomDataFromId1 FetchRandomDataFromId2

---------------------- ----------------------

3542 137378

4606 301534

9346 31374

10626 110734

14702 106954

14978 279338

If you do a little digging, you can see the values from the previous generation in here since I used the same seed:

SELECT FetchRandomDataFromId1

FROM #HoldFetchRandomDataFrom

UNION ALL

SELECT FetchRandomDataFromId2

FROM #HoldFetchRandomDataFrom

ORDER BY FetchRandomDataFromId1

This returns:

FetchRandomDataFromId1

----------------------

1670

2870

3542

4606

6982

9346

10626

And the first, single row query returned (copied from earlier in the document:

FetchRandomDataFromId

---------------------

1670

2870

3542

4606

6982

10626

12846

You can see a new value in the sequence (which makes sense because here are 2X the number of random values generated), but still the same values.

Register for Simple Talk

Generating Repeatable Sets Of Test Rows

The Setup

Using NEWID()

Repeatable Random values

Article tags

About the author

Louis Davidson

Louis's contributions

Articles

Books

Top topics

Louis's latest contributions:

The good, the bad, and the awful taste muscle memory can have

Sponsoring Scenic City Summit

What about Loyalty?

The Setup

Using NEWID()

Repeatable Random values

Multiple Related Repeatable Random values

Article tags

Recommended

About the author

Louis Davidson

Louis's contributions

Articles

Books

Top topics

Louis's latest contributions:

The good, the bad, and the awful taste muscle memory can have

Sponsoring Scenic City Summit

What about Loyalty?