Constraining and checking JSON Data in SQL Server Tables

So you have a database with JSON in it. Can you validate it? I don’t mean just to ensure that it is valid JSON, but ensure that the JSON contains values that are legitimate. Are NI values, postcodes or bank codes valid? Can the dates or GUIDs be successfully parsed? Are those integers really integers? Are any of those dates of birth possible for a person who could be alive today? Are those part numbers valid?

The short answer is that if the JSON document represents a list or a table, then it is easy: But if it is a complex hierarchy, you are better off doing it in PowerShell using JSON Schema. In this article we’ll be showing how it is possible to use a mixture of ordinary constraints and table variables to achieve clean data.

Document columns

In relational databases, each column holds an ‘atomic’ value, meaning that it isn’t a collection of data items such as a list of values. A geographic location may considered to be a collection of coordinate values but from the data perspective, it is just a location: nobody is ordinarily interested in latitude or longitude or anything else, so from the perspective of the database users, the geographic location can be treated as ‘atomic’. You check it as such.

Putting a data document into a column is a slippery slope. By ‘data document’, I’m referring to a structured string from which more than one data element can be stored or extracted. A list is a simple example. An XML document or fragment is another one, and so is JSON. Whereas, with an ordinary relation column you can enforce constraints on what can get stored there, or on an XML document with an XML schema, this isn’t so with a JSON document or a list. If circumstances force you to store JSON Data in a table and it is not ‘atomic’, it is your responsibility to check that the data within it is clean. In other words, If you have to store such a thing, then you have to know what sort of data is in there, and must prevent other data being stored in there. You must, in fact, ‘constrain’ the data.

Why the need to constrain values?

Why? This is because the data is very easily presented in a relational format in views. There is no ‘chinese wall’ between JSON Data and your well-checked and disciplined relational data. It is there and it might be wrong. If you fail to enforce the rules that underlie any datatype, then you can get errors in all sorts of places, including, heaven help us, financial reporting. To take a silly example, if someone slips a new weekday into your innocuous list of days of the week, your daily revenues will be wrong. An inadvertent negative value in revenue figures can cause executives to run up and down corridors shouting. It is not only financial data that has to be checked. If your organisation gets a request to remove someone’s data, you can remove their record and the associated rows in the associated tables. How do you know if their data hasn’t leaked into other tables? Maybe you have a data document that is saved whenever an employee interacts with that customer? What if you decide to mask your data for development work and it turns out that a customer can be identified from a data document containing ‘associated details’ in an address table.

The managers of any organisation are legally obliged to know where personal data is stored, and that it is held responsibly and securely. They must allow customers, members, clients or whatever to inspect, amend and have it deleted. The organisation relies on you to ensure that this corresponds to reality.

Constraints in databases aren’t a luxury. Your job, and the health of the organisation that employs you, depends on ensuring that your date is as clean as automation can manage.

Validating a data-document column

Leaving JSON to one side for a moment, let’s take the simplest sort of data document: a list.

Let’s imagine that you have a list of postcodes. You want to insert it in the table to represent a delivery route for the route of a carrier’s van. (we leave out all the other columns because they’re unlikely to be relevant.)

You can check a postcode such as 'CB4 0WZ' like this

You can create a function to test a list of these postcodes like this:

Here is the code. I’m using the new string_split() function but you can do one of the older techniques

With this function, we can now add a constraint to our table that checks the postcode. If you insert a drivers route that contains an invalid postcode, you get an error. We try it out.

Try these out, altering the postcodes slightly and you’ll see when one or more becomes invalid.

Now, With this function, we can check the list to make sure that every member of the list is a valid postcode.

What if your developers are storing such a thing in a JSON string rather than a list? Doing this is probably better as a JSON String because you can store null values in it and the parsing is quicker.

We can check it just as easily. Firstly we create the function to do it.

Now we have the function, we can use this in an almost identical table

All we are doing is to iterate through the values checking each in turn

We’re still coping as we get to more complex examples than a simple list. However, the JSON could be anything. If it is representing tabular data, then it is easy.

Imagine we have to validate the JSON before we insert it. Imagine, too, that you have JSON consisting of a name, a birthdate, a guid and a modification-date,

We can check this lot out easily enough, though the code might raise a few eyebrows. First we create a function. It starts by checking, just as the previous one did, for a valid JSON Document. Then it makes sure that there are the correct number of key/value pairs in each object of the array, to correspond with the number of columns in a relational table. It  then creates a table variable with all the SQL Server constraints, including NULL/NOT NULL in it that you would apply if the JSON table were a relational table, along with the correct SQL Server datatype. If one of these fire, then the error will appear at the point at which the JSON was inserted into the table. Also, if the JSON couldn’t be coerced into that datatype the function will once again ‘error-out’ and abort the data insertion. Finally, the table variable runs the constraints. That insertion will then ‘error out’ and cause the insertion into the table to fail.

In effect, you’ve done a dress rehearsal of inserting that data into a relational table, with all the constraints that are available to you.

With this function, we can now insert JSON data into the table without cringing, because it will all be checked. It is now up to us to get those constraint definitions right! First we create the test table. We’ll leave out everything except that which is relevant to the demonstration.

Now we insert a row

OK. That went well

Oooh! An error! It has spotted that Ben Miller has a gender (__Gende__) of ‘B’

So lets try again, but getting the gender right but the GUID (Uniqueidentifier) wrong

Right. Now we get a larger JSON document into the table

Good. All went well.

Without that test in the function to make sure there were just five key/value pairs in each document, someone could slip extra data into the database. This could perhaps be data that should never go in there such as customer personal data. We have prevented this from happening

…which gives the result …

Conclusion

As long as the JSON document being stored in a table is tabular in nature rather than hierarchical, we can deal with it in SQL Server. An array-in-array JSON document needs just light editing of the OPENJSON function call to access array elements rather than specifying the key. I’ve explained how to do this in another article.

There are a lot of advantages in using the weapons you know, and have to hand, when dealing with bad data. Constraints and coercion are good simple ways of ensuring that the data is correct.

If the JSON is hierarchical, then we are generally forced to deal with it by checking against the JSON Schema. I do this via PowerShell, so it can’t be done at the point of insertion. It also requires the developers to be organized enough to provide you an up-to-date JSON Schema. I’ve explained in another article how one can open up a hierarchical JSON document and investigate the values. This method can be used if you need to keep the checks ‘in-house’, but it is slow to debug and will need to be maintained if the JSON Schema changes as a development process.