Celko’s Summer SQL Stumpers: Prime Numbers

Joe Celko kicks off our series of Summer SQL Stumpers with a challenge to improve on his solution to calculating the prime numbers between 1 and 10000. Once the various solutions have been contributed and judged, the winner will be announced. The competition will be run on Simple-Talk and SQL Server Central together.

A PRIME SQL PUZZLE

I was teaching SQL classes for YAPC-10 (“Yet Another PERL Conference” #10) at Carnegie Mellon University at the end of June 2009.  For the record, I have never used PERL and had to Google up an overview before I went; it is a very different creature from SQL. 

One of my students asked if you could write an SQL statement to generate the prime numbers less than 1000 (or any other limit) that scales well.  He was bothered by the lack of loops in SQL and a Prime Number sieve is a common PERL programming exercise.  You can Google it and see an animation at Eratosthenes’ sieve and some PERL code at Sieve of Eratosthenes with closures

My immediate answer was “sure, but you might have to use a recursive CTE to replace the loop.  Later I realized that was a really bad answer; you don’t need recursion, just a little math.  There are two useful facts from Number Theory:

  1. The prime factors of a given number (n) cannot be greater than ceiling (ân).  Think about it; by definition (ân * ân)) = n, and by definition, ceiling (ân) >= floor(ân) so integer rounding up will be safe.  This says that if I look at (a * b = c) where (a < b), then I don’t have to look at (b * a = c), so I can start searching for prime factors with small values. 
  2. All primes are of the form (6 * n ± 1), but not all number of that form are Primes.  For example (n = 1) gives us {5, 7} and they are both primes.  But for (n = 4) gives us {23, 25} where (25 = 5 * 5).  What this does is remove the multiples of 2 and 3 from consideration.

Let’s get all of that into SQL statements.  Let’s start with a table for the primes:

Now, your puzzle is to fill the table up to some limit, say 1000 just to keep it simple. 

 ANSWERS:

Let’s assume we have a table named Sequence with integers from 1 to (n) that we can use.  This is a common SQL programming idiom, so you don’t have to feel bad about using it.

There are lots of ways of filling this table, but here is one I like:

This template is easy to extend and the “.. + 1” gets rid of the zero. 

ANSWER #1

For the first attempt, let’s load the Primes table with candidate numbers using math fact #2 from above. 

An improvement which gets rid of the UNION ALL uses a table constant:

Now we have too many rows in Primes and need to remove the non-primes.  Now math fact #1 can come into play; test the set of numbers less than the square root to see if there is a factor among them. 

ANSWER #2

Another way to load the candidates into Primes is to have the first few known primes hardwired into a query.  This is a generalization of the math fact #2, which dealt with multiples of only 2 and 3. 

The idea is that if we can limit the candidate set for Primes, performance will improve.  At the extreme, if the list of "MOD (seq, <prime>)" expressions goes to a value equal or higher than the upper limit we are looking at, we get the answer immediately.

This is a good trick; many SQL programmers think that an IN() list can only be constants.  You might also want to look at how many values it can hold -It is larger than you think. 

Another candidate pruning trick is based on the math fact that integers with final digits {2, 4, 6, 8, 0} are even numbers; those with final digits {5, 0} are multiples of five. Let’s not look at them when we build a candidate table.

ANSWER #3

Another approach is to generate all of the non-primes and remove them from the Sequence table.

Obviously, the Sequence table in the left hand clause could be anyone of the trimmed candidate tables we previously constructed.  

What answers to do you have? As a hint, there are faster but more complicated algorithms, like the Sieve of Atkin and the various Wheel Sieves.

We have attached files containing code that runs in SQL 200 SQL 2005 and SQL 2008. Joe’s code in the article  has been fixed to run in SQL 2008.

The best answer to each stumper will be given a prize of a $100 Amazon voucher. The stumper will be run simultaneously on  SQL Server Central and Simple-Talk. To see all the comments so far, you will need to visit both sites. We will take entries for a week after the first Monday of publication,  posted as comments to the articles. Two weeks after the challenge is sent out, the judge’s decision and comments will be sent out in the newsletter, and published on the site.

Joe Celko and Phil Factor will judge the answers to this puzzle. Your answer should :
   1) Solve the problem — Duh!
   2) Avoid proprietary features in SQL Server that will not port or be good across
        all releases, present and future.
   3) Use Standard features in SQL Server that will be good across all releases,
        present and future. Extra points for porting code.
   4) Be clever but not obscure.
   5) Explain how you got your answer.