But the Database Worked in Development! Checking for Duplicates

Comments 0

Share to social media

The series so far:

Like all grownup database developers, you will do your best to ensure that there are unique constraints and unique indexes in place to prevent duplicates ever getting into the database. These checks are essential, and the RDBMS requires very little effort to make the check when data is inserted. Unless you are working with a very good business analyst, you’ll always miss a few when you create a database, so the addition or alteration of these duplicate checks are an essential part of database maintenance.

Where you were able to do development work on the actual production database, you could check and remove existing duplicates, before you added these constraints, and then provide a data-migration script to go with the deployment. Nowadays, it is more awkward. You’ll probably be required to work on masked data or entirely generated data. You cannot check data if it isn’t exactly what is on production, and it is difficult to predict what could be lurking in the production data.

Now, you have to do all your development work on the constraints and rely on someone else to run the code to remove the duplicates. This is potentially awkward, unless you change your habits to allow easier cooperation between teams. You can, however, very easily generate the code to do these checks, and, even if someone else runs the code, have enough detail about the offending rows to be able to correct the data in a migration script, which can be tested by Ops people on staging.

The same applies to an automated build process. When you’re building a database and then loading in data, it’s much better to check a dataset before constraints are enabled. Once again, what you need is a test script that runs the checks and reports on all the offending rows and only enables constraints if there is no data that would cause an error.

The first article in the series went over the principles of how you might do this as part of a deployment process, with check constraints. Now we move on to checking for duplicates, as defined by your unique indexes.

Reporting on what must be unique in a database

To determine what is defined as being unique in a database, you need to look at both unique constraints that cause a unique index to be created automatically, and unique indexes that are declared directly and explicitly. (All code samples in this series can be found on GitHub.)

Note that we’ve chosen to comment out the line that delivers just the enabled constraints. It is a matter of choice, but if an index is there and disabled, I regard it with a certain suspicion.

This will produce a JSON report like this (I’ve just included a few rows):

This is one of the reports that I do routinely after a successful build and before data is loaded. This JSON file will usually become part of the build package during deployment. In a sense, it is part of the documentation of the database.

We need to check the data, and report on all the tables and their duplicate rows. We do this after it is loaded and before constraints are enabled. If we are doing a release by synchronizing with an existing version of the data, it is done by synchronizing with a version of the database that has its constraints disabled, and before a post-deployment script enables them.

We can, of course, generate the report and then use it immediately. We do this if we have built the database, ensured that constraints are disabled and then loaded the data.

Here is a batch that produces a JSON report. The batch not only checks for duplicates in the data but also checks that the table and columns mentioned in the constraint exist. If it can’t run the test because the columns aren’t there, then it reports the fact. It makes as few assumptions as possible.

Note that this script must be checked with your security team because actual data for the relevant columns of duplicate rows is stored in the resulting report. There is no way around this, but the risks of ‘leakage’ are low, and you get to be able to keep almost all sensitive data at arm’s length.

I’ve now got this all wrapped up in a temporary stored procedure TestAllUniqueIndexes.sql which is available on Github.

Testing it out: finding duplicate rows in AdventureWorks

Well, this is all pretty straightforward, but we need to be able to test it. We’ll take poor old AdventureWorks2016, which has some tables that are very handy for the purpose. The Sales.SalesTaxRate table has two unique indexes on top of its surrogate primary key. One of them, AK_SalesTaxRate_rowguid , ensures that the rowguid is unique: The other one, AK_SalesTaxRate_StateProvinceID_TaxType, ensures that there is only one tax rate for any Sales Tax rate, tax type and location. We can try introducing a duplicate:

This works because we’ve deliberately disabled the checks before we introduced a duplicate. We can test this by enabling the indexes:

So, we’ve now confirmed that this data will trigger an error in a build if we enable the unique constraints and indexes. We can now run the code to check all the tables in the database. Before we do this, however, here’s the code you can run after your tests to return AdventureWorks to its former state:

Running TestAllUniqueIndexes.sql produces a JSON report that tells us how to fix the problem.

This JSON report has two parts, the duplicate list and the errors. We can breathe a sigh of relief that there is no error list, which would mean that the database has changed enough that either columns or tables have changed their name. There are two duplications in the list, because two unique indexes have been compromised by the row we inserted. The report tells you what values were duplicates from the perspective of the unique constraint being checked

This information is intended to give the developer enough information to provide a script that corrects the data, but it will need to be tested on a database that has a good realistic generation of data so you can run tests that ensure that only the duplicate rows are affected by your corrections.

Automating the process

Because we’ve kicked into touch the problem of inputs and outputs from procedures by using JSON, we can simplify the process. I’ve designed a process that does as many setup and tear-down SQL files as you need, and which will do the following three essential processes:

  1. Check the new database after loading the data, but before the constraints and indexes are enabled. This runs a check that will tell you what you need to fix and put these into a report, for every database you specify on every server you specify. I realise that things start with just one database on one server but tend to escalate.
  2. Report on the constraints of a database. Check constraints, unique constraints and foreign key constraints are all reported on separately, in a file. The most important message was whether everything was OK, but all potential breaches of the proposed constraint are reported along with errors in the process
  3. Use the report on the constraints to run checks on the data in the specified database.

Logically speaking, processes 2 and 3 are components of process 1. However, you are not always lucky enough, as a developer, to be in charge of loading and checking the data all in one process. Having them as two different processes gives you more flexibility to fit in with the deployment regime.

What we’ve done is to design stored procedures for testing, or gathering data for, the various constraints. Because they all take JSON input and JSON output, we can radically simplify the PowerShell script we use to do the checks. We must keep with a common convention for alerting the user to the presence of rows that would trigger constraint errors if constraints were enabled, and we need a common way of advising if we can’t even run the test.

Because there are a lot of files and code, I’ve created a GitHub site with all the code in it here.

Summary

There are two points in any build, test or deployment where you can get into difficulties with your data because you have duplicates, bad data or data that has lost its relational integrity. Firstly, when you do a build, disable constraints temporarily and Load (BCP) in the data, and secondly, when you synchronize with a version of the database that does more checking of the data. If you have existing bad data, you need a way of fixing it. To do that, you need to know about the data that would fail the constraint tests that your constraints would use if they were enabled.

We need a slightly different ways of testing Check constraints (bad data checks), Unique Constraints (duplicate checks) and Foreign Key constraints (relational integrity checks). We can store the list of constraints from the source database as a JSON file, and we can take this list as a source and store the result of our tests in a JSON report file

To run all these tests in a flexible way that fits in with a wide range in methods of deployment, I’ve devised a general-purpose data-driven way of running these tests and reports in PowerShell.

In my final article I’ll describe how to check foreign key constraints