Random names and the Birthday Paradox

I’m currently in the middle of giving Red Gate’s sample “Widgets” database a bit of a face lift, in preparation for the new version of SQL Data Compare 6 that’s coming soon. As part of this, I wanted to populate a pretty generic “Contacts” table with some sample names.

So, off I go to Mr Google, and find some lists of the Top {n} Baby Names, Top {n} Surnames and so on, and make myself a spreadsheet with both of those lists. Then combine that with a few RAND() calls and VLOOKUP()s, and I have my list of names. All good.

However, we have testers here, and they’re rather good. Out of my 60 odd names that I generated, there were quite a few duplicates – not just duplicated surnames, but duplicated firstname AND surname combinations. This surprised me quite a bit – I had about 50 surnames, and 25 first names, which I figured should give me 1,250 possible combinations. So why so many duplicates in a set of only 60 results?

Being rather early on a Friday morning, I opted for the brute force method of attacking the problem, and just added some more names – now I’ve got 100 surnames and 70 first names. But still, the duplicates were there. At this point I added a COUNTIF() function to the end of each result row so I could see the scale of the problem at a glance, and it was quite scary.

Upping the number of generated names to 199, I got the following distribution of appearances:

  • Once: 26
  • Twice: 66
  • Three times: 63
  • Four times: 24
  • Five times: 20

To put it another way, I only had 90 unique results out the 199 I’d generated. Ouch.

I’ve been well and truly bitten by the Birthday Paradox here (the thing that states that if you have more than 23 people in a room, the chances of two of them sharing the same birthday are more than 50%). A quick back-of-the envelope calculation suggests that with the size of my data sets, I’ll reach the 50% probability point after about 25 results.

So, what to do?

If I was coding this in C#, my first instinct would be to generate a combination, check if it was already in the set of results, and throw it out if it was, but the results I’ve got above should demonstrate that this is probably a lot less efficient than you might think.

A more interesting option is this: you have n possible combinations, and want m results. For result i, you generate a random number between (n/m)*i and (n/m)*(i+1). This guarantees you won’t get duplicates, and still gives you a pretty good random distribution, albeit not perfect.You could probably improve this with more complex distributions if you want truly random data, but that’s beyond my knowledge of statistics.