Simple SQL: Random Thoughts

How does one get a truly random sample of data of a certain size from a SQL Server database table. Well, there are simple non-portable tricks one can use, such as the NewID() function, but then refining those can be tricky. Take the Rand() function for a start. Can it really provide you with a truly random number? Why doesn't the TABLESAMPLE clause give you a set number of rows? Joe Celko scratches his head a bit, explains some of the issues and invites some suggestions and tricks from readers.

“Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.” – John von Neumann

John von Neumann was not against the use of pseudo-random numbers, but he thought that you will be very, very careful about using them (“Various Techniques Used in Connection with Random Digits” by John von Neumann in Monte Carlo Method (1951) edited by A. S. Householder, G. E. Forsythe, and H. H. Germond).

Randomness is a good topic in itself, but I want to talk instead about getting a random sample out of a table in a database. Think of this as another one of my “aggregation” articles. Most SQL server programmers only know about the RAND() function and never think further about it; after all, Microsoft wrote it: Right?

There’s a fundamental mistake in just accepting any pseudo-random number generators as being truly random. First of all, there are two kinds of random samples; sampling with replacement and sampling without replacement. These terms come from the rather classic model of drawing numbered balls from an urn. This analogy pops up in every book on probability or introduction to statistics. When I draw a ball out do I keep it, or put it back in the urn? Since I seldom have a bunch of balls in an urn, I like to refer to these as “shooting dice” or “cutting cards” instead. If I have replacement, then I can get the same value in different drawings; if there is no replacement, once I have picked a value it does not occur again in the sample.

If a pseudo random number generator (PRNG to use the usual abbreviation), generates a prior value, then you’re hung in a cycle and will repeat the same series of pseudo-random numbers again. That is the nature of deterministic arithmetic expressions. PRNG algorithms are usually written at the machine level for speed. Here’s a simple one for 31 bit numbers, implemented in C.

There are various algorithms for picking the values of the parameters in the formula to give a complete cycle in their range. Unfortunately, this family of algorithms has some problems. One of the horror stories from the late 1960s was the discovery that RANDU (), one of the standard random number generators from IBM, was flawed.

The definition of randomness was that every number in the range of the function would occur at least once in the cycle. This one’s very obvious and easy to test for; but it’s not the only test. The next important factor is that the number should be evenly distributed over the range. That means no lumps or clusters. Unfortunately, RANDU() has a little problem creating triplets in its number space

(http://physics.ucsc.edu/~peter/115/randu.pdf). The upshot of all of this was at a lot of PhD theses were made questionable along with a lot of research.

Do not feel too smug about that, because Microsoft has the same sort of problems (Statistical tests of the IBM PC pseudorandom number generator.). The Microsoft generator has exhibited a number of flaws including, for some seeds, a lack of uniformity of generated sequences of numbers, and serial correlation within such sequences. If you pick the right seeds for the generator, then your generated sequences can at least pass the basic test for randomness.

Physical Random Number Generators

Before you get too depressed, it’s probably worth pointing out that when Alan Turing was doing his research with the Enigma machines, he had a physical random number generator. It was basically a vacuum tube (that’s what we had before transistors, for you young people). The tube counted electrical fluctuations. Later, the RAND Corporation created a similar machine, which they used to create a book entitled “A Million Random Digits with 100,000 Normal Deviates” (ISBN 978-0833030474), which was a research classic for decades. We actually used the book by opening it at random and reading off numbers. This actually wasn’t as good as we thought it was, because physical books tend to get breaks in their spine and flop open to certain sections over time. And before you ask, yes, there was a sequel – “A Million Random Digits THE SEQUEL: with Perfectly Uniform Distribution” (ISBN 978-146100250).

A similar machine, ERNIE, designed by the Bletchley Park codebreaking team in the 1940s, was used to generate random numbers for the UK Premium Bond lottery. The British Post Office made a great documentary called “The Importance of Being E.R.N.I.E.” which you can find on YouTube. Please don’t let the clothing, hairstyles and IT equipment make you laugh too hard.

In late 1960s, DEC (Digital Equipment Corporation) offered a circuit card that would plug in the bus of their PDP series of computers which had a little speck of some kind of radioactive material and a circuit to detect it – basically a Geiger counter on a card! In theory, if all our theories of quantum physics are right, radioactive decay is truly unpredictable and would meet the qualifications of a true random number generator. The bad news is that the circuit card had to also have that little three-bladed black & yellow propeller thingy to tell people there was radioactive material. That symbol, along with the three-bladed bio-hazard symbol are really scary for a lot of people. And since there really wasn’t much use for it outside of laboratories, such devices disappeared when Digital Equipment Corporation disappeared.

Nowadays, we use a variety of random natural events such as atmospheric noise. (see https://www.random.org/ ) to get a true random number generator (TRNG)

Quick Fix Trick

If you are simply trying to retrieve an exact number or percentage of random rows, use:

The NEWID() function is probably random enough for most applications. This, of course, is highly proprietary, but quick and easy. If you Google around a bit, you’ll find some other code for doing “good enough or quick work” sampling. Just don’t to expect to get your PhD with this kind of code

.

TABLESAMPLE Clause

TABLESAMPLE() is a feature that appears in SQL Server and other products. It is based on the SQL:2003 Standard.

An example is:

It will return a sample from the Personnel table, based on data pages, which has a size based on then number specified in the parameter. This sample table is the passed to the WHERE and SELECT clauses way. SQL Server, Postgres and DB2 have implemented this clause. The goal was to allow the discovery of general trends and patterns in data.

You’re supposed to record request a number of rows or percentage from a table as a sample. The big problem is that Microsoft has implemented this feature based on data pages, not rows (that’s what the SYSTEM keyword option means; we will talk about the BERNOULLI option shortly). This means that it cannot be guaranteed to return the number of rows you specified; it returns all of the rows from the data pages it found. Unless your table only has fixed width columns, the pages pulled out based on a percentage or row count could contain very different numbers of rows.

But a bigger problem from a theoretical viewpoint is by returning pages you get clustering. Think about when you insert data into a table, you do it in a cluster, usually sorted from an outside source. Thus, one page might have data from mostly County A, the next page might have data mostly from County B, etc. the physical order of data entry affects what is assigned to pages; the rows are not randomly distributed over the entire table. This gets even worse when you have a clustered index, which will cause the table to be sorted. This is essentially an electronic version of wearing a space in the spine of our book of random digits that we had in the old days!

REPEATABLE Option

You can make this more consistent using the REPEATABLE option, but that still won’t make it return the desired number of rows. This clause says that the pages used to get the sampling we use the same random seed, specified as a parameter in this option, each time you go to the base table. The hope is that the data will stay the same. Compare this to the REPEATABLE READ option in transactions.

Remember that TABLESAMPLE cannot be applied to derived tables, tables from linked servers, and tables derived from table-valued functions, rowset functions, or OPENXML. Nor can you use it in the definition of a view or an inline table-valued function. Essentially, it needs to go to a base table where the data physically exist in pages on the disk.

BERNOULLI Option

This sampling method will scan the whole table and randomly pick individual rows. It basically does “coin flip” for each row, so that each row has an equal likelihood of being in the sample. This algorithm gives much better random distribution but will be slower for small percentages; it is a table scan.

Microsoft has not implemented this yet. However, generalizing this to various sample distributions is going to be harder. SQL is being moved from a database language to some kind of statistical package, and I don’t feel this is good idea.

If anyone would like to post some of their tricks in the responses, please do so.