{"id":94825,"date":"2022-09-30T16:43:35","date_gmt":"2022-09-30T16:43:35","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=94825"},"modified":"2022-10-18T18:07:04","modified_gmt":"2022-10-18T18:07:04","slug":"generating-repeatable-sets-of-test-rows","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/generating-repeatable-sets-of-test-rows\/","title":{"rendered":"Generating Repeatable Sets Of Test Rows"},"content":{"rendered":"<p>In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an account into a table and a set of interests. I wanted then to be able to load a set of random data into edge, related one account to another (one follows the other), and then an account to a random set of interests.<\/p>\n<p>In this article, I will discuss a few techniques I used, starting with the simplest method using ordering by <code>NEWID()<\/code>, then using <code>RAND()<\/code> to allow the code to generate a fixed set of data.<\/p>\n<h2>The Setup<\/h2>\n<p>As the base data, I am going to create a table named: <code>Demo.FetchRandomDataFrom<\/code> that will serve as my stand-in for any table you want to randomly fetch data from.<\/p>\n<pre class=\"lang:none theme:none\">USE Tempdb; \r\nGO\r\nCREATE SCHEMA Demo;\r\nGO\r\nSET NOCOUNT ON;\r\nDROP TABLE IF EXISTS  Demo.FetchRandomDataFrom\r\nGO\r\nCREATE TABLE Demo.FetchRandomDataFrom\r\n(\r\n    --this value should be the primary key of your table, \r\n    --ideally using the clustered index key for best \r\n    --performance, but definitelu unique.\r\n    FetchRandomDataFromId INT IDENTITY(2,4) \r\n  \tCONSTRAINT PKFetchRandomDataFrom PRIMARY KEY,\r\n    SomeOtherData CHAR(1000) NOT NULL\r\n)\r\nGO\r\n--Make the rows of a consequential size. Small row size often\r\n--is an issue with random sets of data you want to use because \r\n--it performs different than real tables.\r\nINSERT INTO Demo.FetchRandomDataFrom\r\n( SomeOtherData )\r\nVALUES (REPLICATE('a',1000));\r\nGO 100000 --create 100000 rows<\/pre>\n<p>Let&#8217;s take a look at a few rows in my table. Note that I made the surrogate key start at 2 and increment by 4. the goal was to make good and sure that the data was obviously not a sequential number. If your keys are pretty much sequential, you may not need to do some steps I have included, but unless it is a perfect set without gaps, that may come into play when trying to make repeatable outcomes from the same data.<\/p>\n<pre class=\"lang:none theme:none\">SELECT TOP 10 *\r\nFROM   Demo.FetchRandomDataFrom\r\nORDER BY FetchRandomDataFromId<\/pre>\n<p>This returns:<\/p>\n<pre class=\"lang:none theme:none \">FetchRandomDataFromId SomeOtherData\r\n--------------------- ----------------------------------\r\n2                     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n6                     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n10                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n14                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n18                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n22                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n26                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n30                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n34                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\r\n38                    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa<\/pre>\n<p>This will be my stand in for a more realistic set of data where <code>SomeOtherData<\/code> would actually have been number, dates, values, etc.<\/p>\n<h2>Using NEWID()<\/h2>\n<p>If you simply need a random set of rows from a table, you can simply do something like this:<\/p>\n<pre class=\"lang:none theme:none\">SELECT TOP 10 FetchRandomDataFromId\r\nFROM   Demo.FetchRandomDataFrom\r\nORDER BY NEWID();<\/pre>\n<p>This just sorts the data in a random order based on the <code>NEWID()<\/code> GUID values being generated. It is a wonderful technique you can do with almost any set of data, Executing the previous statement will return a random set of rows each time you execute it, such as:<\/p>\n<pre class=\"lang:none theme:none\">FetchRandomDataFromId\r\n---------------------\r\n177818\r\n138446\r\n272154\r\n171854\r\n195266\r\n105330\r\n314422\r\n203530\r\n395666\r\n365046<\/pre>\n<p>There are data sampling code types in SQL Server syntax, which you can read about here in <a href=\"https:\/\/www.mssqltips.com\/sqlservertip\/3157\/different-ways-to-get-random-data-for-sql-server-data-sampling\/\">an article<\/a> by <a href=\"https:\/\/www.mssqltips.com\/sqlserverauthor\/100\/derek-colley\/\">Derek Colley<\/a>, but like this article also quotes, if you want a random set, most basic methods use <code>NEWID()<\/code> to get truly random outputs.<\/p>\n<p>The biggest pro with that method is that it is simple and very much random. However, there are two downfalls to this method. First, no duplicated values will be found in that set. For most cases, that is great, but in other cases, you may wish to have the chance of the same value being generated multiple times (in my real case, I absolutely want the same node to end up connected to more than one node. Second, it is non repeatable. If you want to send this code to someone else, they will always get a different answer. Obviously, this may or may not be an issue, but I can\u2019t count how many times I have generated a set, found something interesting about it and then lost the set.<\/p>\n<h2>Repeatable Random values<\/h2>\n<p>In order to make a random enough set that was repeatable, I then decided to take advantage of the <code>RAND()<\/code> function and one of its most interesting capabilities. This being that it isn&#8217;t really technical random as you can pass any number into it as a seed, and then all the next values generated are technically deterministic.<\/p>\n<p>For example, execute the following code on any SQL Server:<\/p>\n<pre class=\"lang:none theme:none\">SELECT RAND(1)\r\nUNION ALL\r\nSELECT RAND()\r\nUNION ALL\r\nSELECT RAND()\r\nUNION ALL\r\nSELECT RAND()\r\nUNION ALL\r\nSELECT RAND()\r\nUNION ALL\r\nSELECT RAND()<\/pre>\n<p>The output will be:<\/p>\n<pre class=\"lang:none theme:none\">----------------------\r\n0.713591993212924\r\n0.472241415009636\r\n0.943996112912269\r\n0.0088524383111658\r\n0.343443392823765\r\n0.210903950196124<\/pre>\n<p>I got the same output on SQL Server 2019 and 2022 RC0. Even if you execute <code>RAND(1)<\/code> and then an hour later, the rest of the code, the &#8220;random&#8221; value that is output will be the same. So, in the following code, I will use this by letting the user set a seed (or just take a random seed when exploring data sets, but then telling the user the seed in case they wish to use it again). The following code builds on this concept using <code>RAND()<\/code> and loading a temp table of random data:<\/p>\n<p>It is a lot of code, but I included quite a bit of comments to make it clear:<\/p>\n<pre class=\"lang:none theme:none \">DROP TABLE IF EXISTS #HoldFetchRandomDataFrom\r\nGO\r\nCREATE TABLE #HoldFetchRandomDataFrom\r\n(\r\n    FetchRandomDataFromId INT PRIMARY KEY\r\n)\r\nDECLARE\t@FetchRandomDataFromId INT, --hold the fetched value\r\n\t\t@RowsLoopCounter INT = 0, --number of rows fetched\r\n\t\t@LoopCounter INT = 0, --number of loops made, \r\n                                --to stop if too many\r\n\t\t@RowsInTableCount INT, --Used to be able to set the \r\n                                 --RAND value\r\n\t\t@SafetyFactor NUMERIC(3,1) = 2, --this times the loop \r\n                                         --counter for max loops\r\n\t\t\r\n\t\t@DesiredRowCount INT = 100, --number of rows you want \r\n                                            --in output\r\n\t\t@Seed INT = 1; --set specific value if you want \r\n                              --repeatability\r\n\t\t--default the seed (doing this, I give it a seed that \r\n                --is repeatable and will override the one that was set \r\n                --in the next execution unless you reconnect.\r\n\t\tSET @seed = COALESCE(@seed,2147483647 * RAND())\r\n\t\t--sets the seed to start generating\r\n\r\n\t\tDECLARE @SetSeed INT = RAND(@seed)\r\n\r\n\t\tSELECT CONCAT('Seed used: ',@seed);\r\n\r\n--load all of the data into this table. The PrepId column will \r\n--hold a sequential value to make random number easier\r\nDROP TABLE IF EXISTS #PrepFetchRandomDataFrom \r\nCREATE TABLE #PrepFetchRandomDataFrom\r\n(\r\n     --Start at 0 because RAND can produce 0\r\n\tPrepId INT IDENTITY(0,1), \r\n\tFetchRandomDataFromId INT NOT NULL\r\n)\r\nINSERT INTO #PrepFetchRandomDataFrom\r\n(\r\n    FetchRandomDataFromId\r\n)\r\nSELECT FetchRandomDataFromId\r\nFROM  Demo.FetchRandomDataFrom;\r\n--the random number will be this times RAND. (it has to be \r\n--captured in a generic bit of code because you could choose a \r\n--value &gt; the number of rows\r\nSET @RowsInTableCount = (SELECT COUNT(*) \r\n                         FROM #PrepFetchRandomDataFrom);\r\nWHILE 1=1\r\n BEGIN\r\n\t--this tells you the number of loops done.\r\n\tSET @Loopcounter = @LoopCounter + 1;\r\n\t--this way, if some error keeps happening, you will \r\n     --eventually stop\r\n\tIF @LoopCounter &gt;= @DesiredRowCount * @SafetyFactor\r\n\t  BEGIN\r\n\t\t RAISERROR('Loop stopped due to safety counter',10,1); \r\n\t\t BREAK;\r\n\t  END;\r\n\t--fetch one random row\r\n\tSELECT @FetchRandomDataFromId = RAND() * @RowsInTableCount\r\n\t--do the insert in a try. You may have criteria in your \r\n     --table that needs to be met (uniqueness, for example)\r\n\tBEGIN TRY\r\n\t--join to get the real value from the table\r\n\tINSERT INTO #HoldFetchRandomDataFrom(FetchRandomDataFromId)\r\n\tSELECT FetchRandomDataFromId\r\n\tFROM    #PrepFetchRandomDataFrom\r\n\tWHERE  PrepId = @FetchRandomDataFromId\r\n\t\r\n\t--this is set inside the try, because the row has been \r\n     --added\r\n\tSET @RowsLoopCounter = @RowsLoopCounter + 1;\r\n\t--this is then end if the rows inserted match the desired \r\n     --rowcount\r\n\tIF @RowsLoopCounter = @DesiredRowCount\r\n\t BREAK;\r\n\tEND TRY\r\n\tBEGIN CATCH\r\n\t   --most likely error is a duplicated value\r\n\t   DECLARE @msg NVARCHAR(1000);\r\n\t   --display the message in case you do something such \r\n           --that the loop can never complete\r\n\t   SET @msg = CONCAT('Some error occurred:' \r\n                                             ,ERROR_MESSAGE())\r\n\t   RAISERROR(@msg,10,1) WITH NOWAIT; --immediate message \r\n\t   --Note, we keep going. This is what the safety factor \r\n          --was for, because an error could cause infinite loop\r\n\tEND CATCH;\r\nEND;<\/pre>\n<p>Output the rows:<\/p>\n<pre class=\"lang:none theme:none\">SELECT *\r\nFROM   #HoldFetchRandomDataFrom\r\nORDER BY FetchRandomDataFromId;<\/pre>\n<p>Now since I set the seed value to 1, you will see the same output:<\/p>\n<pre class=\"lang:none theme:none\">-----------------------\r\nSeed used: 1\r\n\r\nFetchRandomDataFromId\r\n---------------------\r\n1670\r\n2870\r\n3542\r\n4606\r\n6982\r\n10626\r\n12846<\/pre>\n<p>Clearly this is a LOT more code than simply <code>TOP 1 ORDER BY NEWID()<\/code>. Am I saying this is always necessary? Not at all. But it is a technique that you can use to generate a repeatable random set of data (as far as you care it to be random).<\/p>\n<p>For example, you could test some code with a random set of data that you find an error. Instead of having to save that data off for the next person, you could just use the script. My reason for doing this is for a book I am working on. I want to be able to send a set of edges to the reader without making this huge script. So I needed two correlated random values. This method made that pretty easy.<\/p>\n<h2>Multiple Related Repeatable Random values<\/h2>\n<p>What made me first try this was I wanted to generate edges, which need a from value and a to value. When I tried to do this with the NEWID() method in a loop 300000 times, it took close to forever to complete (I gave up at 150000 rows after that executed all night).<\/p>\n<p>That is how this next piece of code was written:<\/p>\n<pre class=\"lang:none theme:none\">DROP TABLE IF EXISTS #HoldFetchRandomDataFrom\r\nGO\r\nCREATE TABLE #HoldFetchRandomDataFrom\r\n(\r\n\t FetchRandomDataFromId1 INT NOT NULL,\r\n\t FetchRandomDataFromId2 INT NOT NULL,\r\n\t --no duplicate rows\r\n\t PRIMARY KEY (FetchRandomDataFromId1,\r\n                                 FetchRandomDataFromId2),\r\n\t --no self relationship. Needs to be different values\r\n\t CHECK (FetchRandomDataFromId1 &lt;&gt; FetchRandomDataFromId2)\r\n)\r\nDECLARE @FetchRandomDataFromId1 INT, --hold the first fetched \r\n                                     --value\r\n\t   @FetchRandomDataFromId2 INT, --the second fetched value\r\n\t   @RowsLoopCounter INT = 0, --number of rows fetched\r\n\t   @LoopCounter INT = 0, --number of loops made, to stop if \r\n                              --too many    \t        \r\n        @RowsInTableCount INT, --Used to be able to set the RAND \r\n                               --value\r\n        @SafetyFactor NUMERIC(3,1) = 2, --this times the loop  \r\n                               --counter for max loops\r\n\t\t\r\n        @DesiredRowCount INT = 100, -- number of rows you want \r\n                                    --in output\r\n        @Seed INT = 1; --set specific value if you want \r\n                        --repeatability\r\n\t\t--default the seeds (doing this, I give it a seed that \r\n          --is repeatable and will override the one that was set \r\n          --in the next execution unless you reconnect.\r\n\t\tSET @seed = COALESCE(@seed,2147483647 * RAND())\r\n\t\t--sets the seed to start generating\r\n\t\tDECLARE @SetSeed INT = RAND(@seed)\r\n\t\tSELECT CONCAT('Seed used:',@seed);\r\n--load all of the data into this table. The PrepId column will \r\n--hold a sequential value to make random number easier'\r\nDROP TABLE IF EXISTS #PrepFetchRandomDataFrom \r\nCREATE TABLE #PrepFetchRandomDataFrom\r\n(\r\n      --Start at 0 because RAND can produce 0\r\n\tPrepId INT IDENTITY(0,1), \r\n\tFetchRandomDataFromId INT NOT NULL\r\n)\r\nINSERT INTO #PrepFetchRandomDataFrom\r\n(\r\n    FetchRandomDataFromId\r\n)\r\nSELECT FetchRandomDataFromId\r\nFROM  Demo.FetchRandomDataFrom;\r\n--the random number will be this times RAND. (it has to be \r\n--captured in a generic bit of code because you could choose a \r\n--value &gt; the number of rows\r\nSET @RowsInTableCount = (SELECT COUNT(*) \r\n                         FROM #PrepFetchRandomDataFrom);\r\nWHILE 1=1\r\n BEGIN\r\n\t--this tells you the number of loops done.\r\n\tSET @Loopcounter = @LoopCounter + 1;\r\n\t--this way, if some error keeps happening, you will \r\n     --eventually stop\r\n\tIF @LoopCounter &gt;= @DesiredRowCount * @SafetyFactor\r\n\t  BEGIN\r\n\t\t RAISERROR('Loop stopped due to safety counter',10,1); \r\n\t\t BREAK;\r\n\t  END;\r\n    --fetch one random row\r\n    SELECT @FetchRandomDataFromId1 = RAND() * @RowsInTableCount;\r\n    SELECT @FetchRandomDataFromId2 = RAND() * @RowsInTableCount;\r\n\t--do the insert in a try. You may have criteria in your \r\n     --table that needs to be met (uniqueness, for example)\r\n\tBEGIN TRY\r\n\t--join to get the real value from the table\r\n\tINSERT INTO #HoldFetchRandomDataFrom\r\n              (FetchRandomDataFromId1,FetchRandomDataFromId2)\r\n\tVALUES((SELECT FetchRandomDataFromId\r\n\t\t\tFROM    #PrepFetchRandomDataFrom\r\n\t\t\tWHERE  PrepId = @FetchRandomDataFromId1),\r\n\t\t\t\t\t(SELECT FetchRandomDataFromId\r\n\t\t\t\t\tFROM    #PrepFetchRandomDataFrom\r\n\t\t\t\t\tWHERE  PrepId = \r\n                                   @FetchRandomDataFromId2));\r\n\t\r\n\t--this is set inside the try, because the row has \r\n     --been added\r\n\tSET @RowsLoopCounter = @RowsLoopCounter + 1;\r\n\t--this is then end if the rows inserted match the desired \r\n     --rowcount\r\n\tIF @RowsLoopCounter = @DesiredRowCount\r\n\t BREAK;\r\n\tEND TRY\r\n\tBEGIN CATCH\r\n\t\t--most likely error is a duplicated value\r\n\t\tDECLARE @msg NVARCHAR(1000);\r\n\t\t--display the message in case you do something such \r\n          --that the loop can never complete\r\n\t\tSET @msg = CONCAT('Some error occurred:' \r\n                                             ,ERROR_MESSAGE())\r\n\t\tRAISERROR(@msg,10,1) WITH NOWAIT; --immediate message \r\n\t\t--Note, we keep going. This is what the safety factor \r\n          --was for, because an error could cause infinite loop\r\n\tEND CATCH\r\nEND;\r\n--Output the rows:\r\nSELECT *\r\nFROM   #HoldFetchRandomDataFrom\r\nORDER BY FetchRandomDataFromId1;<\/pre>\n<p>Executing this returned:<\/p>\n<pre class=\"lang:none theme:none\">---------------------\r\nSeed used:1\r\nFetchRandomDataFromId1 FetchRandomDataFromId2\r\n---------------------- ----------------------\r\n3542                   137378\r\n4606                   301534\r\n9346                   31374\r\n10626                  110734\r\n14702                  106954\r\n14978                  279338<\/pre>\n<p>If you do a little digging, you can see the values from the previous generation in here since I used the same seed:<\/p>\n<pre class=\"lang:none theme:none\">SELECT FetchRandomDataFromId1\r\nFROM #HoldFetchRandomDataFrom\r\nUNION ALL\r\nSELECT FetchRandomDataFromId2\r\nFROM #HoldFetchRandomDataFrom\r\nORDER BY FetchRandomDataFromId1<\/pre>\n<p>This returns:<\/p>\n<p>FetchRandomDataFromId1<\/p>\n<pre class=\"lang:none theme:none\">----------------------\r\n1670\r\n2870\r\n3542\r\n4606\r\n6982\r\n9346\r\n10626<\/pre>\n<p>And the first, single row query returned (copied from earlier in the document:<\/p>\n<pre class=\"lang:none theme:none\">FetchRandomDataFromId\r\n---------------------\r\n1670\r\n2870\r\n3542\r\n4606\r\n6982\r\n10626\r\n12846<\/pre>\n<p>You can see a new value in the sequence (which makes sense because here are 2X the number of random values generated), but still the same values.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In order to test graph structures, I needed a large set of random data. In some ways, this data will resemble the IMDB database I will include later in this chapter, but to make it one, controllable in size and two, random, I created this random dataset. I loaded a set of values for an&#8230;&hellip;<\/p>\n","protected":false},"author":56085,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,143531],"tags":[147173,147172,4151,4183,4745],"coauthors":[19684],"class_list":["post-94825","post","type-post","status-publish","format-standard","hentry","category-blogs","category-t-sql-programming-sql-server","tag-generating-data","tag-randomness","tag-sql-server","tag-t-sql","tag-transact-sql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/94825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/56085"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=94825"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/94825\/revisions"}],"predecessor-version":[{"id":94829,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/94825\/revisions\/94829"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=94825"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=94825"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=94825"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=94825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}