PASS Data Community Summit logo

PASS Data Community Summit 2023

Get all the latest announcements direct to your inbox

Weighted Randomization In T-SQL

When I am not working on or writing about SQL, I am posting theme park pictures on Twitter of DisneyWorld (@DisneyPicADay) and Dollywood (@DollywoodP). The fun part of it is taking the pictures. The HARD part is deciding what to post every day. So instead of picking things from my head, I decided to harness the power of SQL Server and build a random picture tweet generator using SQL Server (I installed an Express Edition server on my machine to house my “production” data :)). I loaded all of my prepared picture files in a filetable in SQL Server, and created tables for an inventory of all of the locations and attractions at the US Disney parks (Dollywood will come later), and tagged the pictures. The entire process is something I may post about later. Mostly to show some other cool SQL Server features, but when the code is clean enough, I will put it all out on github.

Making a random number generator is pretty easy in SQL Server, just pick the top and bottom values and use the RAND() function:

Run that code a few million times, it will give you a value between 1 and 10 every time, each approximately 10 percent of the time.

The thing is, while I have 77 pictures of my favorite roller coaster, Expedition Everest at Disney’s Animal Kingdom, and 70 of The Tower of Terror at Hollywood Studios, I only have 2 of the Flame Tree Restaurant at Animal Kingdom and many other things that aren’t as exciting to post about. If I randomly choose attractions using a non-weighted random number generator, it would be just as likely to get the lesser items as the same frequency as the greater items. Hence, I want my popular items to come up most frequently, but every once in a while, I want to be surprised by something different.

This is where I needed to build a weighted randomized value. (There is more than weighting to the process of choosing what to post, naturally, like picture usage, seasons, holidays, etc. This may come up in a later post.) So, for my example, I will create an object to store the list of items that I have to choose from. To keep it simple, I will create a very simple Item table, and give each item row a simple name. Just 10 items will be enough to demonstrate the concept, but this will work for a large number of items easily:

Let’s start by showing the basic process. To store test data, I will create a table to hold the values that are randomly chosen. It is always important to test out your random number generators to make sure you get a reasonable distribution, at least most of the times you execute it.

Then I will insert 10000 items using GO <repeat count>:

Take a look at the distribution of values, and you should see something very similar to the following. It is a RANDOM number generator, so you might actually get the same values over and over. But most of the time, the distribution will be close to what you expect, with some random value having a few more values, and others having slightly less:

ItemId
----------- -----------
4           1071
10          1062
1           1019
9           1007
8           1006
5           1003
3           1003
7           952
6           949
2           928

That method of using RAND() directly works fine, if you have no gaps in your sequence, but that is not ideal in most cases. An alternative is to use NEWID() to sort the output randomly, and insert the values in the table.

There will generally be no difference in this output from method, other than performance since instead of a quick random number, you have to generate a guid for every row, sort the values, and grab a row:

ItemId
----------- -----------
5           1066
9           1061
3           1029
8           1003
4           1002
6           990
1           976
7           969
2           955
10          949

But to the problem at hand. I will want to make sure my best pictures get chosen more often that the others. Let’s say 9 and 10 are the ones that we want to see more often, so let’s make sure that they get 10 times the attention as everyone else. I will do this by adding a column to the Item table, like this:

Checking out the data:

Pretty clear, 9 and 10 are set to be 10 times more likely to be chosen.

ItemId      Name                 ProbabilityFactor
----------- -------------------- -----------------
1           One                  1
2           Two                  1
3           Three                1
4           Four                 1
5           Five                 1
6           Six                  1
7           Seven                1
8           Eight                1
9           Nine                 10
10          Ten                  10

But how to implement this? I took my inspiration from lottery tickets and a post I can’t find again that was built for a different language. After I had finished, I found this post on StackOverflow with the base workings of a similar approach (https://stackoverflow.com/questions/26971198/create-a-random-selection-weighted-on-number-of-points-sql). The idea is, if 9 and 10 need to be 10 times more likely to be chosen, then I need to give them 10 times more tickets. I will do this using the following VIEW object:

Query this view, and you will see the following, which we can use to do a between on the start and end values with a random value:

ItemId      BaseTicketsStart BaseTicketsEnd
----------- ---------------- --------------
1           0                0
2           1                1
3           2                2
4           3                3
5           4                4
6           5                5
7           6                6
8           7                7
9           8                17
10          18               27

So let’s truncate the temp table and run this test again (if you have a lot of values, and a non-volatile list, you could store this to a permanent table and index the values too):

If you look at the data from the values you have generated:

Now you can see that 9 and 10 are indeed picked ~10 times more often:

ItemId
----------- -----------
9           3614
10          3598
2           378
1           366
7           364
3           347
4           345
8           341
6           339
5           308