This is the third in a series of articles (see Part 1 and Part 2) aiming to dissect and analyze the cleverest and fastest SQL solutions to the challenges posed by the Speed Phreak competitions, expose and explain the core set-based techniques on which they rely, and so make these techniques more widely-accessible to developers who need a faster and more scalable solution than the cursor can offer.
This time, Phil Factor has engaged some of the best minds in the T-SQL world to tackle the SSN matching SQL Problem, where entrants had to provide an SQL solution to establish the validity or otherwise of user-entered, 9-digit Social Security Numbers (SSNs), compared against a table of validated numbers, and according to a set of established matching rules.
The existing cursor-based solution, referred to affectionately as “The Fu Bar Code” worked, but took over 70 pain-filled minutes to execute. Astonishingly, the vast majority of entrants produced alternative set-based solutions that executed in well under one second, proving just how much time and processing power can be saved by a bit of brain power coupled with some solid set-based techniques. The winning SQL solution was submitted by Matt Whitfield which, on the sample data set, ran in under 0.5 s.
SSSIS and CLR-based solutions
There was very little in it, but on this occasion the fastest entry of all was based on a very clever CLR algorithm, submitted by Johan Ãhlén (320 ms) and second placed entry exploited the multi-threading power of SSIS. These solutions will be covered in separate articles.
Although I didn’t enter the competition, I decided to join in the fun. After investigating and explaining the supplied cursor solution, I will present the “Aunt Kathi” SQL solution, devised before I examined any of the competition entries, which attempts to walk the fine line between outright speed and ease-of-understanding (and therefore ease-of-maintenance), and performs comparably with many of the set-based solutions submitted to the competition.
Finally, I’ll dissect the winning SQL solution from Matt, and point out where and how he managed to squeeze out a bit of extra speed.
The SSN Matching Rules
The sample data for the problem consists of two tables: one holding a set of user entered US social security numbers (UserEntered), and one holding a set of validated social security numbers (Validated). The goal is to produce a list of the user SSNs that exactly match the validated entries, along with those that are closely matched, but differ by one or two digits (due to user error during entry).
If a user-entered SSN has two non-matching digits, compared to a validated SSN, then it is only considered a close match if the non-matching digits are not adjacent. In other words, the non-matching digits must both be in odd positions or must both be in even positions. If a user-entered SSN matches a validated SSN exactly, then it cannot be used with any other validated SSNs. In other words, if there is an exact match then other validated SSNS that are a close match for that user-entered SSN should be excluded from the results. Conversely, a non-matching user SSN may closely match more than one validated SSNs.
The final results must contain the user SSN, the matching or closely matching validated SSN and a status. The status will be 0 for an exact match, 1 for one incorrect digit, or 2 for two incorrect digits. Any user SSNs that are not matches, or close matches, will not be part of the results. The final results must be sorted by validated SSN, user-submitted SSN, and status.
To illustrate how these rules work in practice, let’s look at some sample data and results. Table 1 contains a few validated SSNs from the Validated table and table 2 contains some UserEntered data; both display the odd and even position digits to aid in comparisons. Table 3 gives the results and an explanation for each entry.
Validated SSN |
Odd Digits |
Even Digits |
007309486 |
07046 |
0398 |
007101415 |
07045 |
0111 |
007201415 |
07045 |
0211 |
User Entered SSN |
Odd Digits |
Even Digits |
007309486 |
07046 |
0398 |
007309488 |
07048 |
0398 |
017201415 |
07045 |
1211 |
Table 3 gives the results of a validation check between the entries in each table, and an explanation for each result.
User Entered SSN |
Validated SSN |
Status |
Explanation |
007309486 |
007309486 |
0 |
Exact match. |
007309486 |
007101415 007201417 |
N/A |
Don’t try to match additional entries if an exact match is found. |
007309488 |
007309486 |
1 |
The even digits match, and there is one incorrect digit in the odd positions. |
007309488 |
007101415 007201417 |
N/A |
Neither the even positions or odd are an exact match. |
017201415 |
007309486 |
N/A |
Neither the even nor odd positions are an exact match |
017201415 |
007101415 |
2 |
The odd positions match, and there are two incorrect digits in the even positions. |
017201415 |
007201417 |
1 |
The odd positions match, and there is one Incorrect digit in the even positions. |
Set up
To set up the challenge, follow these steps:
- Create a database on a non-production SQL Server 2005 or later instance.
- Download and unzip the Challenge5.zip file from here http://challenges.developerworkshop.net/challenge5.zip
- Run the “Setup.sql” script in your new database to create two tables.
- Run the “Populate.sql” script in your new database to populate the tables.
The Cursor-Based Solution
The cursor-based “Fu Bar Code”, provided as part of the challenge, gives the correct results but takes a very long time, about 70 minutes, to return those results, using the sample data. To save space, I am not going to provide the complete solution here, but you can find it in its entirety in the “Fu Bar.sql” script included with the Challenge5.zip file.
The opening section of the Fu Bar code (Listing 1) creates a temp table (#CheckedSSN) to hold the results and an accompanying non-clustered index. It then declares and opens the first cursor, to hold and examine each user-entered SSN. The current user-entered SSN value is retrieved using the FETCH statement and is stored in a variable (@SSN).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
SET NOCOUNT ON CREATE TABLE #CheckedSSN ( vSSN CHAR(9) NOT NULL, ueSSN CHAR(9) NOT NULL, [Status] TINYINT NOT NULL ) CREATE NONCLUSTERED INDEX IX_CheckedSSN ON #CheckedSSN ( ueSSN ASC ) INCLUDE ( [Status] ) DECLARE curUserEntered CURSOR FOR SELECT SSN FROM dbo.UserEntered DECLARE @SSN CHAR(9) OPEN curUserEntered FETCH NEXT FROM curUserEntered INTO @SSN WHILE @@FETCH_STATUS = 0 BEGIN |
Once inside the WHILE loop, the code inserts one or more rows into the #CheckedSSN table for each SSN value. The SELECT clause of the INSERT statement is based on a derived table – that is a sub-query within the FROM clause that takes the place of a table or view name. The sub-query compares each position (1 through 9) from the variable to the position of the values in the Validated table. The code uses nine CASE expressions, assigning a 0 for a match or a 1 for a non-match. Inside each CASE expression, the SUBSTRING function is used to parse out one position at a time. The benefit of using the derived table, in this case, is that the code can refer to the results by the aliases pos1, pos2, and so on, in other parts of the statement, instead of repeating the CASE expressions.
If we add together the values held in each odd position, and each even position, then an exact match will be indicated by the result “Odd 0 and Even 0”. Recall that, for a close match, at least the odd positions or even positions must match and that a maximum of two non-matched positions are allowed, which leads to four other possible outcomes: “Odd 2 and Even 0”; “Odd 0 and Even 2”; “Odd 1 and Even 0”; “Odd 0 and Even 1”. Any other outcome indicates a non-match and the results should not be included. The rather-complicated WHERE clause makes sure that only the results for matches and close matches are inserted. The abbreviated code can be found in Listing 2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
INSERT #CheckedSSN ( vSSN, ueSSN, [Status] ) SELECT SSN AS vSSN, @SSN AS ueSSN, pos1 + pos2 + pos3 + pos4 + pos5 + pos6 + pos7 + pos8 + pos9 AS [Status] FROM ( SELECT SSN, CASE SUBSTRING(SSN, 1, 1) WHEN SUBSTRING(@SSN, 1, 1) THEN 0 ELSE 1 END AS pos1, CASE SUBSTRING(SSN, 2, 1) WHEN SUBSTRING(@SSN, 2, 1) THEN 0 ELSE 1 END AS pos2, FROM dbo.Validated ) AS d WHERE ( pos1 + pos3 + pos5 + pos7 + pos9 = 2 AND pos2 + pos4 + pos6 + pos8 = 0 ) OR ( pos1 + pos3 + pos5 + pos7 + pos9 = 0 AND pos2 + pos4 + pos6 + pos8 = 2 ) OR ( pos1 + pos3 + pos5 + pos7 + pos9 = 1 AND pos2 + pos4 + pos6 + pos8 = 0 ) OR ( pos1 + pos3 + pos5 + pos7 + pos9 = 0 AND pos2 + pos4 + pos6 + pos8 = 1 ) OR ( pos1 + pos3 + pos5 + pos7 + pos9 = 0 AND pos2 + pos4 + pos6 + pos8 = 0 ) |
At the bottom of the loop, the next value from the cursor is stored in @SSN as would be expected. After the loop, the code closes the curUserEntered cursor…and then declares another cursor (curMatch)!
At this point, the temp table should have all the required rows. The only problem is that any exactly matching user-entered values may also have some closely matching rows which should not be included. The second cursor eliminates those unneeded rows by deleting any rows with a status greater than 0 if the user-entered value also has a matching row with a status of 0. Unfortunately, the deletions are accomplished one SSN at a time. Sadly, even though it is very easy to delete all the invalid rows with a single DELETE statement, it is fairly common to see code in production applications and scripts that avoid such simplicity in favor of the brute-force cursor.
Finally, the results are displayed. Listing 3 shows the closing section of the script.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
DECLARE curMatch CURSOR FOR SELECT ueSSN FROM CheckedSSN WHERE [Status] = 0 OPEN curMatch FETCH NEXT FROM curMatch INTO @SSN WHILE @@FETCH_STATUS = 0 BEGIN DELETE FROM CheckedSSN WHERE ueSSN = @SSN AND [Status] > 0 FETCH NEXT FROM curMatch INTO @SSN END CLOSE curMatch DEALLOCATE curMatch SELECT vSSN, ueSSN, [Status] FROM CheckedSSN ORDER BY vSSN, ueSSN, [Status] |
The “Aunt Kathi” Solution
Before delving into the winning solution I decided, this time, to devise my own solution first and see how well it performed. Listing 4 shows the “Aunt Kathi” solution, with not a cursor in sight.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
--Create a temp table to hold the results CREATE TABLE #Results(ueSSN CHAR(9), vSSN CHAR(9), Status int ) --Insert the exact matches INSERT INTO #Results(ueSSN, vSSN, Status) SELECT ue.SSN, v.SSN, 0 FROM UserEntered AS ue JOIN Validated AS v ON ue.SSN = v.SSN --Create a temp table to hold remainder of SSN --with columns for each position CREATE TABLE #ueSSN(SSN CHAR(9), Pos1 CHAR(1), Pos2 CHAR(1), Pos3 CHAR(1), Pos4 CHAR(1), Pos5 CHAR(1), Pos6 CHAR(1), Pos7 CHAR(1), Pos8 CHAR(1), Pos9 CHAR(1)) --Populate the #ueSSN table. Parse out the SSN. --Leave out the exact matches! INSERT INTO #ueSSN(SSN, Pos1, Pos2, Pos3, Pos4, Pos5, Pos6, Pos7, Pos8, Pos9) SELECT SSN, SUBSTRING(SSN, 1, 1), SUBSTRING(SSN, 2, 1), SUBSTRING(SSN, 3, 1), SUBSTRING(SSN, 4, 1), SUBSTRING(SSN, 5, 1), SUBSTRING(SSN, 6, 1), SUBSTRING(SSN, 7, 1), SUBSTRING(SSN, 8, 1), SUBSTRING(SSN, 9, 1) FROM UserEntered WHERE SSN NOT IN (SELECT ueSSN FROM #Results) --Create a temp table for the validated SSNs along --with a column for each position CREATE TABLE #vSSN(SSN CHAR(9), Pos1 CHAR(1), Pos2 CHAR(1), Pos3 CHAR(1), Pos4 CHAR(1), Pos5 CHAR(1), Pos6 CHAR(1), Pos7 CHAR(1), Pos8 CHAR(1), Pos9 CHAR(1)) --Insert all validated SSN into #vSSN --Parse out the SSN INSERT INTO #vSSN(SSN, Pos1, Pos2, Pos3, Pos4, Pos5, Pos6, Pos7, Pos8, Pos9) SELECT SSN, SUBSTRING(SSN, 1, 1), SUBSTRING(SSN, 2, 1), SUBSTRING(SSN, 3, 1), SUBSTRING(SSN, 4, 1), SUBSTRING(SSN, 5, 1), SUBSTRING(SSN, 6, 1), SUBSTRING(SSN, 7, 1), SUBSTRING(SSN, 8, 1), SUBSTRING(SSN, 9, 1) FROM Validated --For those that match on the EVEN positions, --compare the odd positions. --Insert the rows where the odd positions add up to < 3 ;WITH Evens AS ( SELECT ue.SSN, v.SSN AS Validated, CASE WHEN ue.Pos1 = v.Pos1 THEN 0 ELSE 1 END + CASE WHEN ue.Pos3 = v.Pos3 THEN 0 ELSE 1 END + CASE WHEN ue.Pos5 = v.Pos5 THEN 0 ELSE 1 END + CASE WHEN ue.Pos7 = v.Pos7 THEN 0 ELSE 1 END + CASE WHEN ue.Pos9 = v.Pos9 THEN 0 ELSE 1 END AS Status FROM #ueSSN AS ue join #vSSN AS V ON ue.Pos2 = v.Pos2 and ue.Pos4 = v.Pos4 and ue.Pos6 = v.Pos6 and ue.Pos8 = v.Pos8) INSERT INTO #Results(ueSSN, vSSN, Status) SELECT SSN, Validated, Status FROM Evens WHERE Status < 3; --For those that match on the ODD positions, --compare the EVEN positions. --Insert the rows where the EVEN positions add up to < 3 ;WITH Odds AS ( SELECT ue.SSN, v.SSN AS Validated, CASE WHEN ue.Pos2 = v.Pos2 THEN 0 ELSE 1 END + CASE WHEN ue.Pos4 = v.Pos4 THEN 0 ELSE 1 END + CASE WHEN ue.Pos6 = v.Pos6 THEN 0 ELSE 1 END + CASE WHEN ue.Pos8 = v.Pos8 THEN 0 ELSE 1 END AS Status FROM #ueSSN AS ue join #vSSN AS V ON ue.Pos1 = v.Pos1 and ue.Pos3 = v.Pos3 AND ue.Pos5 = v.Pos5 AND ue.Pos7 = v.Pos7 AND ue.Pos9 = v.Pos9) INSERT INTO #Results(ueSSN, vSSN, Status) SELECT SSN, Validated, Status FROM Odds WHERE Status < 3 --Display results SELECT ueSSN, vSSN, Status FROM #Results ORDER BY ueSSN, vSSN, Status |
1 |
Listing 4: The "Aunt Kathi" code |
Just like the Fu Bar code, I decided to create a temp table to hold the results. Since the exact matches need no further processing, I decided to insert all of them up front. I knew that I was going to be comparing single positions for the rest of the entries, so I thought it would be more efficient to parse the digits out once and save the parsed-out values. I created temp tables for both the remaining user entered SSNs (#ueSSN) and the validated SSNs (#vSSN) with a column for each position. I inserted the user entered values, minus the exact matches, and used the SUBSTRING function to populate all the individual digits. I used the same technique to insert all of the validated entries into #vSSN.
A close match must match exactly on the even positions or the odd positions. When a column from one table must equal a value in a column from another table, you can join those tables on those columns. I could have created ODD and EVEN columns in my temp tables, but I decided to just join on multiple position columns. If the even positions matched, I just needed to figure out how many odd positions didn’t match and vice versa. In the SELECT clause, I used a CASE expression to assign 0 for matching digits or 1 for non-matching digits and added up the values aliasing the sum as Status. I wrapped this query in a common table expression (CTE) to avoid including the CASE expressions in my WHERE clause. I then only had to filter out any rows with Status values greather than two in the main query. The results of the two queries (one for even and one for odd) were inserted into the temp #Results table. Now the only thing left to do was display the results.
While my solution won’t win any awards, it is quite fast while still being easy to understand. I arrived at the solution by just thinking through what needed to be done and breaking the problem down to its components.
Matt Whitfield’s Winning SQL Solution
Not surprisingly, the winner Matt broke the problem down in a similar way as the “Aunt Kathi” code, but without the use of temp tables. The solution is actually only one complex statement which utilizes CTEs, derived tables, and UNION queries. The code creates two CTEs, one for the user entered values (UserEnteredTmp) and one for the validated values (ValidatedTmp). Not only do the CTEs have the digits parsed out, they also have ODD and EVEN columns. To create the user entered CTE, he used nested derived tables and a NOT IN sub-query. The matching values were eliminated from the user entered CTE at this point. The second CTE is similar, but there is no need to eliminate anything. They both parse the positions with SUBSTRING and concatenate back together to create the ODD and EVEN columns. Listing 5 shows the CTE definitions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
WITH UserEnteredTmp ([u1], [u2], [u3], [u4], [u5], [u6], [u7], [u8], [u9], [odds], [evens], [ssn]) AS ( SELECT [u1], [u2], [u3], [u4], [u5], [u6], [u7], [u8], [u9], [u1] + [u3] + [u5] + [u7] + [u9], [u2] + [u4] + [u6] + [u8], [SSN] FROM ( SELECT SUBSTRING(SSN, 1, 1) AS u1, SUBSTRING(SSN, 2, 1) AS u2, SUBSTRING(SSN, 3, 1) AS u3, SUBSTRING(SSN, 4, 1) AS u4, SUBSTRING(SSN, 5, 1) AS u5, SUBSTRING(SSN, 6, 1) AS u6, SUBSTRING(SSN, 7, 1) AS u7, SUBSTRING(SSN, 8, 1) AS u8, SUBSTRING(SSN, 9, 1) AS u9, [SSN] FROM (SELECT [SSN] AS SSN FROM [dbo].[UserEntered] WHERE [SSN] NOT IN (SELECT [SSN] FROM [dbo].[Validated]) ) idat ) odat ), ValidatedTmp ([u1], [u2], [u3], [u4], [u5], [u6], [u7], [u8], [u9], [odds], [evens], [ssn]) AS ( SELECT [u1], [u2], [u3], [u4], [u5], [u6], [u7], [u8], [u9], [u1] + [u3] + [u5] + [u7] + [u9], [u2] + [u4] + [u6] + [u8], [ssn] FROM ( SELECT SUBSTRING(SSN, 1, 1) AS u1, SUBSTRING(SSN, 2, 1) AS u2, SUBSTRING(SSN, 3, 1) AS u3, SUBSTRING(SSN, 4, 1) AS u4, SUBSTRING(SSN, 5, 1) AS u5, SUBSTRING(SSN, 6, 1) AS u6, SUBSTRING(SSN, 7, 1) AS u7, SUBSTRING(SSN, 8, 1) AS u8, SUBSTRING(SSN, 9, 1) AS u9, [SSN] FROM [dbo].[Validated] ) odat ) |
After the CTE definitions comes the main query, which implements the matching logic. The actual meat of the main query is a derived table composed of a UNION SSN query. The first component of the union returns the exact matches. Matt chose to retrieve the exact matches using a sub-query of the validated SSNs, rather than by joining the two tables. Listing 6 shows the “exact matches” portion of the query.
1 2 3 4 5 6 |
SELECT [SSN] AS ueSSN, [SSN] AS vSSN, 0 AS [Status] FROM (SELECT [dbo].[UserEntered].[SSN] AS SSN FROM [dbo].[UserEntered] WHERE [SSN] IN (SELECT [SSN] FROM [dbo].[Validated])) AS idat UNION ALL |
Following the exact match query, Matt has a query for the even position matches and a query for the odd position matches. Since he created Odd and Even columns, it was easy to join UserEnteredTmp to ValidatedTmp. Listing 7 shows the query used to find the odd position matches. Just as in the Aunt Kathi solution, he used CASE to determine the number of matches in even positions and added them up, for the Status column. The subsequent rows were then filtered so that only rows with a status less than three are included.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
SELECT [ueSSN], [vSSN], [Status] FROM ( SELECT [UserEnteredTmp].[ssn] AS ueSSN, [ValidatedTmp].[ssn] AS vSSN, CASE WHEN [ValidatedTmp].[u2] = [UserEnteredTmp].[u2] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u4] = [UserEnteredTmp].[u4] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u6] = [UserEnteredTmp].[u6] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u8] = [UserEnteredTmp].[u8] THEN 0 ELSE 1 END AS [Status] FROM UserEnteredTmp INNER JOIN ValidatedTmp ON [UserEnteredTmp].[odds] = [ValidatedTmp].[odds]) AS idat WHERE [Status] < 3 UNION ALL |
The final section of code performs the even position matches, and the necessary ordering, as shown in Listing 8.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
SELECT [ueSSN], [vSSN], [Status] FROM ( SELECT [UserEnteredTmp].[ssn] AS ueSSN, [ValidatedTmp].[ssn] AS vSSN, CASE WHEN [ValidatedTmp].[u1] = [UserEnteredTmp].[u1] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u3] = [UserEnteredTmp].[u3] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u5] = [UserEnteredTmp].[u5] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u7] = [UserEnteredTmp].[u7] THEN 0 ELSE 1 END + CASE WHEN [ValidatedTmp].[u9] = [UserEnteredTmp].[u9] THEN 0 ELSE 1 END AS [Status] FROM UserEnteredTmp INNER JOIN ValidatedTmp ON [UserEnteredTmp].[evens] = [ValidatedTmp].[evens]) AS idat WHERE [Status] < 3 ) AS idd ORDER BY [vSSN], [ueSSN], [Status]; |
An interesting general point to note about the code is the use of UNION ALL instead of just UNION. Use of UNION ALL ensures that any duplicate rows remain in the results. In this case, no actual duplicates should exist; SQL Server just had to combine the results of the three queries and didn’t need to perform the work of finding duplicates and removing them. As such, one may be tempted to use UNION but If you take a look at the execution plans between otherwise identical UNION queries, you will see that the UNION query uses a Merge Join to pull the individual results together, while a UNION ALL uses a more-efficient Concatenation, and so allows Matt to squeeze out just a little bit more speed.
Table 4 shows the results when the three methods were applied to the sample data.
Solution |
Results |
Fu Bar code (cursor) |
70 to 75 minutes |
Aunt Kathi code (temp tables) |
< 1 second |
Winning code (CTE) |
391 milliseconds |
Conclusion
Once again, a set based CTE solution outperformed both cursor and temp table solutions. The keys to this puzzle were to break it down into the three components and to parse out the 9 digit SSNs into their individual digits. Here are a few more points to be learned:
- Spend some time thinking about the problem before you start writing queries.
- Use some kind of virtual table (CTE or derived table) to avoid using expressions in the WHERE clause.
- Avoid unneeded processing. In this example, don’t parse the matching rows.
- Use UNION ALL if possible instead of just UNION.
- Use CTEs to avoid temp tables where possible.
Load comments