SSN Matching Speed Phreakery

On Ask.SQLServerCentral.com, a group of people interested in experimenting with heavily optimised SQL techniques try them out on a problem, using reasonbly large amounts of data. They aren't so interested in explaining the techniques, so Kathi continues on her mission to explain the lessons learned and the tips that can be derived.

This is the third in a series of articles (see Part 1 and Part 2) aiming to dissect and analyze the cleverest and fastest SQL solutions to the challenges posed by the Speed Phreak competitions, expose and explain the core set-based techniques on which they rely, and so make these techniques more widely-accessible to developers who need a faster and more scalable solution than the cursor can offer.

This time, Phil Factor has engaged some of the best minds in the T-SQL world to tackle the SSN matching SQL Problem, where entrants had to provide an SQL solution to establish the validity or otherwise of user-entered, 9-digit Social Security Numbers (SSNs), compared against a table of validated numbers, and according to a set of established matching rules.

The existing cursor-based solution, referred to affectionately as “The Fu Bar Code” worked, but took over 70 pain-filled minutes to execute. Astonishingly, the vast majority of entrants produced alternative set-based solutions that executed in well under one second, proving just how much time and processing power can be saved by a bit of brain power coupled with some solid set-based techniques. The winning SQL solution was submitted by Matt Whitfield which, on the sample data set, ran in under 0.5 s.

SSSIS and CLR-based solutions
There was very little in it, but on this occasion the fastest entry of all was based on a very clever CLR algorithm, submitted by Johan Ãhlén (320 ms) and second placed entry exploited the multi-threading power of SSIS. These solutions will be covered in separate articles.

Although I didn’t enter the competition, I decided to join in the fun. After investigating and explaining the supplied cursor solution, I will present the “Aunt Kathi” SQL solution, devised before I examined any of the competition entries, which attempts to walk the fine line between outright speed and ease-of-understanding (and therefore ease-of-maintenance), and performs comparably with many of the set-based solutions submitted to the competition.

Finally, I’ll dissect the winning SQL solution from Matt, and point out where and how he managed to squeeze out a bit of extra speed.

The SSN Matching Rules

The sample data for the problem consists of two tables: one holding a set of user entered US social security numbers (UserEntered), and one holding a set of validated social security numbers (Validated). The goal is to produce a list of the user SSNs that exactly match the validated entries, along with those that are closely matched, but differ by one or two digits (due to user error during entry).

If a user-entered SSN has two non-matching digits, compared to a validated SSN, then it is only considered a close match if the non-matching digits are not adjacent. In other words, the non-matching digits must both be in odd positions or must both be in even positions. If a user-entered SSN matches a validated SSN exactly, then it cannot be used with any other validated SSNs. In other words, if there is an exact match then other validated SSNS that are a close match for that user-entered SSN should be excluded from the results. Conversely, a non-matching user SSN may closely match more than one validated SSNs.

The final results must contain the user SSN, the matching or closely matching validated SSN and a status. The status will be 0 for an exact match, 1 for one incorrect digit, or 2 for two incorrect digits. Any user SSNs that are not matches, or close matches, will not be part of the results. The final results must be sorted by validated SSN, user-submitted SSN, and status.

To illustrate how these rules work in practice, let’s look at some sample data and results. Table 1 contains a few validated SSNs from the Validated table and table 2 contains some UserEntered data; both display the odd and even position digits to aid in comparisons. Table 3 gives the results and an explanation for each entry.

Validated SSN

Odd Digits

Even Digits

007309486

07046

0398

007101415

07045

0111

007201415

07045

0211

Table 1: Validated SSN

User Entered SSN

Odd Digits

Even Digits

007309486

07046

0398

007309488

07048

0398

017201415

07045

1211

Table 2: User Entered SSN

Table 3 gives the results of a validation check between the entries in each table, and an explanation for each result.

User Entered SSN

Validated SSN

Status

Explanation

007309486

007309486

0

Exact match.

007309486

007101415

007201417

N/A

Don’t try to match additional entries if an exact match is found.

007309488

007309486

1

The even digits match, and there is

one incorrect digit in the odd positions.

007309488

007101415

007201417

N/A

Neither the even positions or odd

are an exact match.

017201415

007309486

N/A

Neither the even nor odd positions are an

exact match

017201415

007101415

2

The odd positions match, and there are

two incorrect digits in the even positions.

017201415

007201417

1

The odd positions match, and there is one

Incorrect digit in the even positions.

Table 3: The Results

Set up

To set up the challenge, follow these steps:

  • Create a database on a non-production SQL Server 2005 or later instance.
  • Download and unzip the Challenge5.zip file from here http://challenges.developerworkshop.net/challenge5.zip
  • Run the “Setup.sql” script in your new database to create two tables.
  • Run the “Populate.sql” script in your new database to populate the tables.

The Cursor-Based Solution

The cursor-based “Fu Bar Code”, provided as part of the challenge, gives the correct results but takes a very long time, about 70 minutes, to return those results, using the sample data. To save space, I am not going to provide the complete solution here, but you can find it in its entirety in the “Fu Bar.sql” script included with the Challenge5.zip file.

The opening section of the Fu Bar code (Listing 1) creates a temp table (#CheckedSSN) to hold the results and an accompanying non-clustered index. It then declares and opens the first cursor, to hold and examine each user-entered SSN. The current user-entered SSN value is retrieved using the FETCH statement and is stored in a variable (@SSN).

Listing 1: A temp table and a cursor

Once inside the WHILE loop, the code inserts one or more rows into the #CheckedSSN table for each SSN value. The SELECT clause of the INSERT statement is based on a derived table – that is a sub-query within the FROM clause that takes the place of a table or view name. The sub-query compares each position (1 through 9) from the variable to the position of the values in the Validated table. The code uses nine CASE expressions, assigning a 0 for a match or a 1 for a non-match. Inside each CASE expression, the SUBSTRING function is used to parse out one position at a time. The benefit of using the derived table, in this case, is that the code can refer to the results by the aliases pos1, pos2, and so on, in other parts of the statement, instead of repeating the CASE expressions.

If we add together the values held in each odd position, and each even position, then an exact match will be indicated by the result “Odd 0 and Even 0”. Recall that, for a close match, at least the odd positions or even positions must match and that a maximum of two non-matched positions are allowed, which leads to four other possible outcomes: “Odd 2 and Even 0”; “Odd 0 and Even 2”; “Odd 1 and Even 0”; “Odd 0 and Even 1”. Any other outcome indicates a non-match and the results should not be included. The rather-complicated WHERE clause makes sure that only the results for matches and close matches are inserted. The abbreviated code can be found in Listing 2.

Listing 2: The abbreviated INSERT statement

At the bottom of the loop, the next value from the cursor is stored in @SSN as would be expected. After the loop, the code closes the curUserEntered cursor…and then declares another cursor (curMatch)!

At this point, the temp table should have all the required rows. The only problem is that any exactly matching user-entered values may also have some closely matching rows which should not be included. The second cursor eliminates those unneeded rows by deleting any rows with a status greater than 0 if the user-entered value also has a matching row with a status of 0. Unfortunately, the deletions are accomplished one SSN at a time. Sadly, even though it is very easy to delete all the invalid rows with a single DELETE statement, it is fairly common to see code in production applications and scripts that avoid such simplicity in favor of the brute-force cursor.

Finally, the results are displayed. Listing 3 shows the closing section of the script.

Listing 3: The second cursor and the SELECT

The “Aunt Kathi” Solution

Before delving into the winning solution I decided, this time, to devise my own solution first and see how well it performed. Listing 4 shows the “Aunt Kathi” solution, with not a cursor in sight.

Just like the Fu Bar code, I decided to create a temp table to hold the results. Since the exact matches need no further processing, I decided to insert all of them up front. I knew that I was going to be comparing single positions for the rest of the entries, so I thought it would be more efficient to parse the digits out once and save the parsed-out values. I created temp tables for both the remaining user entered SSNs (#ueSSN) and the validated SSNs (#vSSN) with a column for each position. I inserted the user entered values, minus the exact matches, and used the SUBSTRING function to populate all the individual digits. I used the same technique to insert all of the validated entries into #vSSN.

A close match must match exactly on the even positions or the odd positions. When a column from one table must equal a value in a column from another table, you can join those tables on those columns. I could have created ODD and EVEN columns in my temp tables, but I decided to just join on multiple position columns. If the even positions matched, I just needed to figure out how many odd positions didn’t match and vice versa. In the SELECT clause, I used a CASE expression to assign 0 for matching digits or 1 for non-matching digits and added up the values aliasing the sum as Status. I wrapped this query in a common table expression (CTE) to avoid including the CASE expressions in my WHERE clause. I then only had to filter out any rows with Status values greather than two in the main query. The results of the two queries (one for even and one for odd) were inserted into the temp #Results table. Now the only thing left to do was display the results.

While my solution won’t win any awards, it is quite fast while still being easy to understand. I arrived at the solution by just thinking through what needed to be done and breaking the problem down to its components.

Matt Whitfield’s Winning SQL Solution

Not surprisingly, the winner Matt broke the problem down in a similar way as the “Aunt Kathi” code, but without the use of temp tables. The solution is actually only one complex statement which utilizes CTEs, derived tables, and UNION queries. The code creates two CTEs, one for the user entered values (UserEnteredTmp) and one for the validated values (ValidatedTmp). Not only do the CTEs have the digits parsed out, they also have ODD and EVEN columns. To create the user entered CTE, he used nested derived tables and a NOT IN sub-query. The matching values were eliminated from the user entered CTE at this point. The second CTE is similar, but there is no need to eliminate anything. They both parse the positions with SUBSTRING and concatenate back together to create the ODD and EVEN columns. Listing 5 shows the CTE definitions.

Listing 5: The CTE definitions

After the CTE definitions comes the main query, which implements the matching logic. The actual meat of the main query is a derived table composed of a UNION SSN query. The first component of the union returns the exact matches. Matt chose to retrieve the exact matches using a sub-query of the validated SSNs, rather than by joining the two tables. Listing 6 shows the “exact matches” portion of the query.

Listing 6: The exact matching rows

Following the exact match query, Matt has a query for the even position matches and a query for the odd position matches. Since he created Odd and Even columns, it was easy to join UserEnteredTmp to ValidatedTmp. Listing 7 shows the query used to find the odd position matches. Just as in the Aunt Kathi solution, he used CASE to determine the number of matches in even positions and added them up, for the Status column. The subsequent rows were then filtered so that only rows with a status less than three are included.

Listing 7: The odd position matches

The final section of code performs the even position matches, and the necessary ordering, as shown in Listing 8.

Listing 8: The even position matches

An interesting general point to note about the code is the use of UNION ALL instead of just UNION. Use of UNION ALL ensures that any duplicate rows remain in the results. In this case, no actual duplicates should exist; SQL Server just had to combine the results of the three queries and didn’t need to perform the work of finding duplicates and removing them. As such, one may be tempted to use UNION but If you take a look at the execution plans between otherwise identical UNION queries, you will see that the UNION query uses a Merge Join to pull the individual results together, while a UNION ALL uses a more-efficient Concatenation, and so allows Matt to squeeze out just a little bit more speed.

Table 4 shows the results when the three methods were applied to the sample data.

Solution

Results

Fu Bar code (cursor)

70 to 75 minutes

Aunt Kathi code (temp tables)

< 1 second

Winning code (CTE)

391 milliseconds

Table 4: Compare the methods

Conclusion

Once again, a set based CTE solution outperformed both cursor and temp table solutions. The keys to this puzzle were to break it down into the three components and to parse out the 9 digit SSNs into their individual digits. Here are a few more points to be learned:

  • Spend some time thinking about the problem before you start writing queries.
  • Use some kind of virtual table (CTE or derived table) to avoid using expressions in the WHERE clause.
  • Avoid unneeded processing. In this example, don’t parse the matching rows.
  • Use UNION ALL if possible instead of just UNION.
  • Use CTEs to avoid temp tables where possible.