{"id":2535,"date":"2007-09-07T04:08:00","date_gmt":"2007-09-07T04:08:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/random-names-and-the-birthday-paradox\/"},"modified":"2016-07-28T10:49:09","modified_gmt":"2016-07-28T10:49:09","slug":"random-names-and-the-birthday-paradox","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/random-names-and-the-birthday-paradox\/","title":{"rendered":"Random names and the Birthday Paradox"},"content":{"rendered":"<p>I&#8217;m currently in the middle of giving Red Gate&#8217;s sample &#8220;Widgets&#8221; database a bit of a face lift, in preparation for the new version of <a href=\"https:\/\/www.simple-talk.com\/community\/blogs\/richard\/archive\/2007\/08\/01\/34301.aspx\">SQL Data Compare 6<\/a> that&#8217;s coming soon. As part of this, I wanted to populate a pretty generic &#8220;Contacts&#8221; table with some sample names.<\/p>\n<p>So, off I go to Mr Google, and find some lists of the <i>Top {n} Baby Names<\/i>, <i>Top {n} Surname<\/i>s and so on, and make myself a spreadsheet with both of those lists. Then combine that with a few RAND() calls and VLOOKUP()s, and I have my list of names. All good.<\/p>\n<p>However, we have testers here, and they&#8217;re rather <i>good<\/i>. Out of my 60 odd names that I generated, there were quite a few duplicates &#8211; not just duplicated surnames, but duplicated firstname AND surname combinations. This surprised me quite a bit &#8211; I had about 50 surnames, and 25 first names, which I figured should give me 1,250 possible combinations. So why so many duplicates in a set of only 60 results?<\/p>\n<p>Being rather early on a Friday morning, I opted for the brute force method of attacking the problem, and just added some more names &#8211; now I&#8217;ve got 100 surnames and 70 first names. But still, the duplicates were there. At this point I added a COUNTIF() function to the end of each result row so I could see the scale of the problem at a glance, and it was quite scary.<\/p>\n<p>Upping the number of generated names to 199, I got the following distribution of appearances:<\/p>\n<ul>\n<li>Once: 26<\/li>\n<li>Twice: 66<\/li>\n<li>Three times: 63<\/li>\n<li>Four times: 24<\/li>\n<li>Five times: 20<\/li>\n<\/ul>\n<p>To put it another way, I only had 90 unique results out the 199 I&#8217;d generated. Ouch.<\/p>\n<p>I&#8217;ve been well and truly bitten by the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Birthday_paradox\">Birthday Paradox<\/a> here (the thing that states that if you have more than 23 people in a room, the chances of two of them sharing the same birthday are more than 50%). A quick back-of-the envelope calculation suggests that with the size of my data sets, I&#8217;ll reach the 50% probability point after about 25 results.<\/p>\n<p>So, what to do?<\/p>\n<p>If I was coding this in C#, my first instinct would be to generate a combination, check if it was already in the set of results, and throw it out if it was, but the results I&#8217;ve got above should demonstrate that this is probably a lot less efficient than you might think.<\/p>\n<p>A more interesting option is this: you have <i>n<\/i> possible combinations, and want <i>m<\/i> results. For result <i>i<\/i>, you generate a random number between <i>(n\/m)*i<\/i> and <i>(n\/m)*(i+1)<\/i>. This guarantees you won&#8217;t get duplicates, and still gives you a pretty good random distribution, albeit not perfect.You could probably improve this with more complex distributions if you want truly random data, but that&#8217;s beyond my knowledge of statistics.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m currently in the middle of giving Red Gate&#8217;s sample &#8220;Widgets&#8221; database a bit of a face lift, in preparation for the new version of SQL Data Compare 6 that&#8217;s coming soon. As part of this, I wanted to populate a pretty generic &#8220;Contacts&#8221; table with some sample names. So, off I go to Mr&#8230;&hellip;<\/p>\n","protected":false},"author":169277,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[],"class_list":["post-2535","post","type-post","status-publish","format-standard","hentry","category-blogs"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2535","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/169277"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=2535"}],"version-history":[{"count":2,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2535\/revisions"}],"predecessor-version":[{"id":41554,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/2535\/revisions\/41554"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=2535"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=2535"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=2535"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=2535"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}