PASS Data Community Summit logo

PASS Data Community Summit 2023

Get all the latest announcements direct to your inbox

Generating Repeatable Sets Of Test Rows

In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.

In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by NEWID(), then using RAND() to allow the code to generate a fixed set of data.

The Setup

As the base data, I am going to create a table named: Demo.FetchRandomDataFrom that will serve as my stand-in for any table you want to randomly fetch data from.

Let’s take a look at a few rows in my table. Note that I made the surrogate key start at 2 and increment by 4. the goal was to make good and sure that the data was obviously not a sequential number. If your keys are pretty much sequential, you may not need to do some steps I have included, but unless it is a perfect set without gaps, that may come into play when trying to make repeatable outcomes from the same data.

This returns:

This will be my stand in for a more realistic set of data where SomeOtherData would actually have been number, dates, values, etc.

Using NEWID()

If you simply need a random set of rows from a table, you can simply do something like this:

This just sorts the data in a random order based on the NEWID() GUID values being generated. It is a wonderful technique you can do with almost any set of data, Executing the previous statement will return a random set of rows each time you execute it, such as:

There are data sampling code types in SQL Server syntax, which you can read about here in an article by Derek Colley, but like this article also quotes, if you want a random set, most basic methods use NEWID() to get truly random outputs.

The biggest pro with that method is that it is simple and very much random. However, there are two downfalls to this method. First, no duplicated values will be found in that set. For most cases, that is great, but in other cases, you may wish to have the chance of the same value being generated multiple times (in my real case, I absolutely want the same node to end up connected to more than one node. Second, it is non repeatable. If you want to send this code to someone else, they will always get a different answer. Obviously, this may or may not be an issue, but I can’t count how many times I have generated a set, found something interesting about it and then lost the set.

Repeatable Random values

In order to make a random enough set that was repeatable, I then decided to take advantage of the RAND() function and one of its most interesting capabilities. This being that it isn’t really technical random as you can pass any number into it as a seed, and then all the next values generated are technically deterministic.

For example, execute the following code on any SQL Server:

The output will be:

I got the same output on SQL Server 2019 and 2022 RC0. Even if you execute RAND(1) and then an hour later, the rest of the code, the “random” value that is output will be the same. So, in the following code, I will use this by letting the user set a seed (or just take a random seed when exploring data sets, but then telling the user the seed in case they wish to use it again). The following code builds on this concept using RAND() and loading a temp table of random data:

It is a lot of code, but I included quite a bit of comments to make it clear:

Output the rows:

Now since I set the seed value to 1, you will see the same output:

Clearly this is a LOT more code than simply TOP 1 ORDER BY NEWID(). Am I saying this is always necessary? Not at all. But it is a technique that you can use to generate a repeatable random set of data (as far as you care it to be random).

For example, you could test some code with a random set of data that you find an error. Instead of having to save that data off for the next person, you could just use the script. My reason for doing this is for a book I am working on. I want to be able to send a set of edges to the reader without making this huge script. So I needed two correlated random values. This method made that pretty easy.

Multiple Related Repeatable Random values

What made me first try this was I wanted to generate edges, which need a from value and a to value. When I tried to do this with the NEWID() method in a loop 300000 times, it took close to forever to complete (I gave up at 150000 rows after that executed all night).

That is how this next piece of code was written:

Executing this returned:

If you do a little digging, you can see the values from the previous generation in here since I used the same seed:

This returns:

FetchRandomDataFromId1

And the first, single row query returned (copied from earlier in the document:

You can see a new value in the sequence (which makes sense because here are 2X the number of random values generated), but still the same values.