{"id":88061,"date":"2020-09-02T03:22:22","date_gmt":"2020-09-02T03:22:22","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=88061"},"modified":"2020-09-03T10:33:27","modified_gmt":"2020-09-03T10:33:27","slug":"weighted-randomization-in-t-sql","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/weighted-randomization-in-t-sql\/","title":{"rendered":"Weighted Randomization In T-SQL"},"content":{"rendered":"<p>When I am not working on or writing about SQL, I am posting theme park pictures on Twitter of DisneyWorld (<a href=\"https:\/\/twitter.com\/disneypicaday\">@DisneyPicADay<\/a>) and Dollywood (<a href=\"https:\/\/twitter.com\/dollywoodp\">@DollywoodP<\/a>). The fun part of it is taking the pictures. The HARD part is deciding what to post every day. So instead of picking things from my head, I decided to harness the power of SQL Server and build a random picture tweet generator using SQL Server (I installed an Express Edition server on my machine to house my &#8220;production&#8221; data :)). I loaded all of my prepared picture files in a filetable in SQL Server, and created tables for an inventory of all of the locations and attractions at the US Disney parks (Dollywood will come later), and tagged the pictures. The entire process is something I may post about later. Mostly to show some other cool SQL Server features, but when the code is clean enough, I will put it all out on github.<\/p>\n<p>Making a random number generator is pretty easy in SQL Server, just pick the top and bottom values and use the RAND() function:<\/p>\n<pre class=\"\"><code>DECLARE @MinValue int = 1, @MaxValue int = 10;<\/code> <code>\r\n\r\nSELECT FLOOR(RAND()*(@MaxValue-@MinValue+1))+@MinValue;<\/code><\/pre>\n<p>Run that code a few million times, it will give you a value between 1 and 10 every time, each approximately 10 percent of the time.<\/p>\n<p>The thing is, while I have 77 pictures of my favorite roller coaster, Expedition Everest at Disney&#8217;s Animal Kingdom, and 70 of The Tower of Terror at Hollywood Studios, I only have 2 of the Flame Tree Restaurant at Animal Kingdom and many other things that aren&#8217;t as exciting to post about. If I randomly choose attractions using a non-weighted random number generator, it would be just as likely to get the lesser items as the same frequency as the greater items. Hence, I want my popular items to come up most frequently, but every once in a while, I want to be surprised by something different.<\/p>\n<p>This is where I needed to build a weighted randomized value. (There is more than weighting to the process of choosing what to post, naturally, like picture usage, seasons, holidays, etc. This may come up in a later post.) So, for my example, I will create an object to store the list of items that I have to choose from. To keep it simple, I will create a very simple Item table, and give each item row a simple name. Just 10 items will be enough to demonstrate the concept, but this will work for a large number of items easily:<\/p>\n<pre class=\"\">CREATE SCHEMA Assets;\r\nGO\r\nCREATE TABLE Assets.Item\r\n(\r\n   ItemId int NOT NULL CONSTRAINT PKItem PRIMARY KEY,\r\n   Name varchar(20) NOT NULL CONSTRAINT AKItem UNIQUE\r\n);\r\nGO\r\nINSERT INTO Assets.Item(ItemId, Name)\r\nVALUES(1,'One'),(2,'Two'),(3,'Three'),(4,'Four'),\r\n      (5,'Five'),(6,'Six'),(7,'Seven'),(8,'Eight'),\r\n      (9,'Nine'),(10,'Ten');<\/pre>\n<p>Let&#8217;s start by showing the basic process. To store test data, I will create a table to hold the values that are randomly chosen. It is always important to test out your random number generators to make sure you get a reasonable distribution, at least most of the times you execute it.<\/p>\n<pre class=\"\">CREATE TABLE #HoldItemPicks(ItemId int, RandomNumber int);<\/pre>\n<p>Then I will insert 10000 items using GO &lt;repeat count&gt;:<\/p>\n<pre class=\"\">SET NOCOUNT ON; --This is essential when doing GO &lt;repeat&gt;\r\nGO\r\nTRUNCATE TABLE #HoldItemPicks; --for repeated testing\r\nGO\r\nDECLARE @MinRand int = 1, @MaxRand int = 10; --we know the id values\r\n--get the random number\r\nDECLARE @RandomNumber int = FLOOR(RAND()*(@MaxRand-@MinRand+1))+@MinRand;\r\n\r\n--insert the rows\r\nINSERT INTO #HoldItemPicks(ItemId, RandomNumber)\r\nVALUES(@RandomNumber,@RandomNumber);\r\nGO 10000\r\n\r\n--output the rows\r\nSELECT ItemId, COUNT(*)\r\nFROM #HoldItemPicks\r\nGROUP BY ItemId\r\nORDER BY COUNT(*) DESC;<\/pre>\n<p>Take a look at the distribution of values, and you should see something very similar to the following. It is a <em>RANDOM<\/em> number generator, so you might actually get the same values over and over. But most of the time, the distribution will be close to what you expect, with some random value having a few more values, and others having slightly less:<\/p>\n<p><code>ItemId <\/code><br \/>\n<code>----------- -----------<\/code><br \/>\n<code>4\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01071<\/code><br \/>\n<code>10\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1062<\/code><br \/>\n<code>1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01019<\/code><br \/>\n<code>9\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01007<\/code><br \/>\n<code>8\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01006<\/code><br \/>\n<code>5\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01003<\/code><br \/>\n<code>3\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01003<\/code><br \/>\n<code>7\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0952<\/code><br \/>\n<code>6\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0949<\/code><br \/>\n<code>2\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0928<\/code><\/p>\n<p>That method of using RAND() directly works fine, if you have no gaps in your sequence, but that is not ideal in most cases. An alternative is to use NEWID() to sort the output randomly, and insert the values in the table.<\/p>\n<pre class=\"\">TRUNCATE TABLE #HoldItemPicks;\r\nGO\r\n\r\n--Alternative:\r\nINSERT INTO #HoldItemPicks(ItemId)\r\nSELECT TOP (1) ItemId\r\nFROM Assets.Item\r\nORDER BY NEWID();\r\nGO 10000\r\n\r\nSELECT ItemId, COUNT(*)\r\nFROM #HoldItemPicks\r\nGROUP BY ItemId\r\nORDER BY COUNT(*) DESC<\/pre>\n<p>There will generally be no difference in this output from method, other than performance since instead of a quick random number, you have to generate a guid for every row, sort the values, and grab a row:<\/p>\n<p><code>ItemId <\/code><br \/>\n<code>----------- -----------<\/code><br \/>\n<code>5\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01066<\/code><br \/>\n<code>9\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01061<\/code><br \/>\n<code>3\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01029<\/code><br \/>\n<code>8\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01003<\/code><br \/>\n<code>4\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01002<\/code><br \/>\n<code>6\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0990<\/code><br \/>\n<code>1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0976<\/code><br \/>\n<code>7\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0969<\/code><br \/>\n<code>2\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0955<\/code><br \/>\n<code>10\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 949<\/code><\/p>\n<p>But to the problem at hand. I will want to make sure my best pictures get chosen more often that the others. Let&#8217;s say 9 and 10 are the ones that we want to see more often, so let&#8217;s make sure that they get 10 times the attention as everyone else. I will do this by adding a column to the Item table, like this:<\/p>\n<pre class=\"\">ALTER TABLE Assets.Item\r\n\u00a0 \u00a0ADD ProbabilityFactor tinyint NOT NULL\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0CONSTRAINT DFLTItem_ProbabilityFactor DEFAULT (1) --DEFAULT to lowest\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0CONSTRAINT CHKItem__ProbabilityFactorDomain \r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 CHECK (ProbabilityFactor BETWEEN 1 AND 100); --can increase up to 100 X\r\nGO\r\n--now set 9 and 10 to 10X so they can be 10 times more likely to show up.\r\nUPDATE Item\r\nSET ProbabilityFactor =ProbabilityFactor * 10\r\nFROM Assets.Item\r\nWHERE itemId IN (9,10);<\/pre>\n<p>Checking out the data:<\/p>\n<pre class=\"\">SELECT *\r\nFROM Assets.Item;<\/pre>\n<p>Pretty clear, 9 and 10 are set to be 10 times more likely to be chosen.<\/p>\n<p><code>ItemId\u00a0 \u00a0 \u00a0 Name\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0ProbabilityFactor<\/code><br \/>\n<code>----------- -------------------- -----------------<\/code><br \/>\n<code>1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0One\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>2\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Two\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>3\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Three\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>4\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Four\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01<\/code><br \/>\n<code>5\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Five\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01<\/code><br \/>\n<code>6\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Six\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>7\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Seven\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>8\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Eight\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>9\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Nine\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a010<\/code><br \/>\n<code>10\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 Ten\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 10<\/code><\/p>\n<p>But how to implement this? I took my inspiration from lottery tickets and a post I can&#8217;t find again that was built for a different language. After I had finished, I found this post on StackOverflow with the base workings of a similar approach (<a href=\"https:\/\/stackoverflow.com\/questions\/26971198\/create-a-random-selection-weighted-on-number-of-points-sql\">https:\/\/stackoverflow.com\/questions\/26971198\/create-a-random-selection-weighted-on-number-of-points-sql<\/a>). The idea is, if 9 and 10 need to be 10 times more likely to be chosen, then I need to give them 10 times more tickets. I will do this using the following VIEW object:<\/p>\n<pre class=\"\">CREATE OR ALTER VIEW Assets.ItemLotterySetup\r\nAS\r\n   SELECT ItemId, \r\n   \r\n   --calculate ticket start the RunningTotal from the previous item\r\n           LAG(BaseRows.RunningTotal,1,0) OVER (ORDER BY ItemId) AS TicketStart,\r\n\r\n   --then the end is the running total + current probability (or # of tickets for the lottery)\r\n           BaseRows.EndingProbabilityFactorFactorBase + \r\n                LAG(BaseRows.RunningTotal,1,0) OVER (ORDER BY ItemId) -1 AS TicketEnd\r\nFROM (\r\n       SELECT ItemId, ProbabilityFactor AS EndingProbabilityFactorFactorBase,\r\n              --this provides the value for the start and end seperated \r\n              SUM(Item.ProbabilityFactor) OVER (ORDER BY ItemId) AS RunningTotal\r\nFROM Assets.Item ) AS BaseRows;<\/pre>\n<p>Query this view, and you will see the following, which we can use to do a between on the start and end values with a random value:<\/p>\n<p><code>ItemId\u00a0 \u00a0 \u00a0 BaseTicketsStart BaseTicketsEnd<\/code><br \/>\n<code>----------- ---------------- --------------<\/code><br \/>\n<code>1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a00\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 0<\/code><br \/>\n<code>2\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a01\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 1<\/code><br \/>\n<code>3\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a02\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 2<\/code><br \/>\n<code>4\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a03\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 3<\/code><br \/>\n<code>5\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a04\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 4<\/code><br \/>\n<code>6\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a05\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 5<\/code><br \/>\n<code>7\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a06\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 6<\/code><br \/>\n<code>8\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a07\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 7<\/code><br \/>\n<code>9\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a08\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 17<\/code><br \/>\n<code>10\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 18\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a027<\/code><\/p>\n<p>So let&#8217;s truncate the temp table and run this test again (if you have a lot of values, and a non-volatile list, you could store this to a permanent table and index the values too):<\/p>\n<pre class=\"\">TRUNCATE TABLE #HoldItemPicks;\r\nGO\r\n\r\nDECLARE @MinTicket int, @MaxTicket int;\r\n\r\n--We need the start and end values. I calculate them, because you \r\n--could also add a filter to the process so the start and end\r\n--change\r\nSELECT @MinTicket = MIN(ItemLotterySetup.TicketStart),\r\n       @MaxTicket = MAX(ItemLotterySetup.TicketEnd)\r\nFROM Assets.ItemLotterySetup;\r\n\r\n--grab a random number\r\nDECLARE @RandomNumber int = FLOOR(RAND()*(@MaxTicket-@MinTicket+1))+@MinTicket;\r\n\r\n--fetch the value\r\nINSERT INTO #HoldItemPicks(ItemId, RandomNumber)\r\nSELECT ItemId, @RandomNumber\r\nFROM Assets.ItemLotterySetup\r\nWHERE @RandomNumber BETWEEN ItemLotterySetup.TicketStart AND ItemLotterySetup.TicketEnd;\r\nGO 10000<\/pre>\n<p>If you look at the data from the values you have generated:<\/p>\n<pre class=\"\">SELECT ItemId, COUNT(*)\r\nFROM #HoldItemPicks\r\nGROUP BY ItemId\r\nORDER BY COUNT(*) DESC;<\/pre>\n<p>Now you can see that 9 and 10 are indeed picked ~10 times more often:<\/p>\n<p><code>ItemId <\/code><br \/>\n<code>----------- -----------<\/code><br \/>\n<code>9\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a03614<\/code><br \/>\n<code>10\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 3598<\/code><br \/>\n<code>2\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0378<\/code><br \/>\n<code>1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0366<\/code><br \/>\n<code>7\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0364<\/code><br \/>\n<code>3\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0347<\/code><br \/>\n<code>4\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0345<\/code><br \/>\n<code>8\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0341<\/code><br \/>\n<code>6\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0339<\/code><br \/>\n<code>5\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0308<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>When I am not working on or writing about SQL, I am posting theme park pictures on Twitter of DisneyWorld (@DisneyPicADay) and Dollywood (@DollywoodP). The fun part of it is taking the pictures. The HARD part is deciding what to post every day. So instead of picking things from my head, I decided to harness&#8230;&hellip;<\/p>\n","protected":false},"author":56085,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[19684],"class_list":["post-88061","post","type-post","status-publish","format-standard","hentry","category-blogs"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/88061","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/56085"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=88061"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/88061\/revisions"}],"predecessor-version":[{"id":88066,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/88061\/revisions\/88066"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=88061"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=88061"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=88061"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=88061"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}