SQL For Cosmos DB – Handling Complex JSON Structures

The JSON documents that are part of Cosmos DB document collections can be complex with arrays and nesting. In this article, Adam Aspin shows you how to query them with SQL.

The series so far:

  1. Introduction to SQL for Cosmos DB
  2. SQL For Cosmos DB – Tips and Tricks
  3. SQL For Cosmos DB – Handling Complex JSON Structures

JSON allows for nested nodes, arrays and arrays of objects, and Cosmos DB SQL can handle all of these when reshaping the output data. Consequently, you will use a second collection when learning how to query more complex JSON documents

This means creating a new collection called complexcars in the Cosmos DB emulator—or even in Cosmos DB if you prefer. Once this collection has been created you need to load into it the documents named complex1.json, complex2.json and complex3.json. Be sure to review how to add the files from this article if you need help.

Each document in this collection looks something like this:

In this document, Customer and Address are subnodes of distinct objects inside the document. Salesdetails is an array of objects and CustomerComments is an array of multiple items.

Choose Attributes from Subnodes

Returning the contents of a node in a JSON collection is mercifully simple. All you have to do is to specify the node, like this:

The output is the selected node-exactly as it appears in the source document:

In practice, this approach can be a useful way of returning multiple elements.

Of course, you can make the output even more fine-grained and return selected attributes from a specific node:

The result in this case is as simple as it is predictable:

And it goes without saying that you can mix and match JSON attributes from varying levels in the hierarchy of nodes by running queries like this one:

In this case the result is:

 

While on this subject it is worth noting that:

  • You can mix from different levels.
  • You can drill down to any node merely by specifying the exact path down through the hierarchy of nodes.
  • You can return the contents of an entire node as well as selected attributes from other nodes using queries like this:

This query gives the following output:

You may remember from the previous article that you can use the ROOT keyword to indicate the collection. Well this is also possible when querying subnodes, like this:

Choose Elements from an Array of Objects

If a JSON document contains arrays of objects (as is the case for salesdetails in the sample documents) then you might need to extend the SQL slightly depending on how you want to display the data. Essentially you have a couple of possibilities:

  • Return the entire arrays of objects, that is, everything inside the array
  • Specify the item in the object that you wish to output
  • Flatten the output and return selected items from each object

Let’s look at each of these approaches in turn using the Salesdetails object in the sample document.

Returning the complete contents of the object is an extension of the technique that you saw previously when returning the complete contents of a node:

Executing this query produces the following result:

The output is truncated in this example, as all six line items from the source documents are returned by the query. However, once you see a couple of them the principle. Hopefully, is clear.

However, the structure can be simplified and returned as a less complex and deep array if you use the IN keyword, like this:

Here the result is subtly different:

Once again, only a subset of the output data is displayed here.

If you don’t want the entire contents of the object you can tweak the SELECT statement to isolate only the required attributes from the array.

Here, as you can see, you are returning only a subset of the items in each array:

What is interesting to note here is that you are using one alias (c) to refer to the collection and another alias (l) to refer to the object itself.

Moreover, you can count the number of objects in an object with an extension of the code you saw above. Here I am using the VALUE keyword to return the value without a JSON attribute name.

The result is simply:

Naturally, you can mix attributes from inside the object with attributes from elsewhere in the document:

In this case the result is:

Specifying the path to the arrays of objects is enough to return the entire contents of the array of objects.

Specifying the individual item inside an array of objects means tweaking the SQL and indicating the (zero-based) item that you want to see in the output, like this:

As you can see below, on this occasion you are only returning one object from the array:

Conversely, flattening the output to return all the items in an array involves using the JOIN keyword, and joining the document to itself-or more precisely to the array itself. If it helps, you can consider this as nearly equivalent to a table join in T-SQL only the second table is an arrays of objects inside the JSON document itself.

Here the output structure is decidedly different:

What is important here is to alias the arrays of objects as the focus of the JOIN keyword and use the IN keyword to identify the path to the arrays of objects in the document. If your document contains multiple arrays of objects that you wish to flatten, then you simply add further JOIN clauses.

Put another way, FROM defines the collection, and JOIN refers to the “inner document” (the array of objects) contained in the outer JSON document.

Handling Arrays

JSON also uses arrays to store items inside a document. The ComplexCars document contains an array named CustomerComments. You can return the contents of an array much like you output the contents of an object using the IN keyword.

This query returns the entire contents of the array-from all the documents:

Searching Inside an Array

Complex JSON documents can contain arrays of elements, and it is always possible that you may need to search inside an array for a specific item. Cosmos DB SQL lets you use the ARRAY_CONTAINS() function to handle this particular challenge:

This time only elements from a document where the array contains the specified text is returned:

Handling Schema on Read

As you may have already discovered (or doubtless soon will) one of the difficulties in a schema-free approach to storing data is that attributes not only do not always appear in JSON documents, but that the same attribute can have different names across the documents in a collection.

It follows that you will need to handle alternative attribute names in JSON documents. The classic way to prevent the lack of a schema causing erroneous output is to use the coalesce (double question mark) operator-like this:

The output is as simple as the query:

The coalesce operator is simple: if the first attribute is nonexistent, then the second one is used. As you can imagine, this operator can save you considerable grief through minimizing erroneous output. However, a few comments may help even further:

If you do not add an alias – $1, $2 etc. is used

You can extend coalesce operator to handle multiple attribute names by writing code like this:

You should look for empty braces in the output that indicate an unhandled attribute.

The challenge in these cases is, of course, discovering the duplicate attributes. This is explained a little further down in this article.

Dealing with Schema on Read in WHERE clause

Multiple attribute names for the same attribute does not only cause issues in the SELECT clause. You may need to filter on an attribute that has multiple synonyms in the document structures. Fortunately, this is not difficult to deal with-as the following SQL shows. Run this code against the simplecars2 collection.

Running this query produces the following JSON:

Avoid Missing Elements and Ensure a Complete Structure

Another variation on a theme of schema fluidity is the occasional need to ensure that the output JSON has a rigid and predictable format. Here again the coalesce operator can be used to guarantee that an attribute will appear in the result set even if it is missing from the source document. Note that in this query, I am using the simplecars2 collection that you saw in the previous article.

This query will return seven records, but here are two examples:

Checking Document Structure and Data Types

Cosmos DB SQL comes with a handful of functions that can assist in minimizing the risk of error that is implicit in the free-form structure of JSON.

Replace strings with 0

Unlike T-SQL, Cosmos DB SQL will not attempt to convert numeric text values to numbers. The direct consequence of this is that any attribute where the value is defined as a string cannot be used in a calculation as this will prevent the calculation from working (and the calculated attribute will not appear in the output). If you look at the single JSON document in the complexcars collection you will see that the first value for the LineItemDiscount attribute is “50” (in double quotes). This will prevent the attribute from being used in a calculation and will prevent any value from being returned.

One solution is to use the IS_NUMBER() function (with a little ternary logic) to replace numeric strings with zeroes, like this:

This query works beautifully, and gives the following result:

Clearly a conversion function would be ideal. Unfortunately, this is not directly available in Cosmos DB SQL. However, there is an alternative which is to write your own type casting function in JavaScript and use this instead. You can, of course, extend this technique to list all values that are stored as strings and therefore could mean that erroneous results are returned. The code for this is extremely simple:

Detect Inappropriate Data Types

Equally, there could be occasions when you need to detect a string data type. This is a simple call to the IS_STRING() function.

The query output tells you that at least one invoice has a string where you would expect a numeric value:

The remaining type detection functions that you may find useful are:

IS_BOOL

IS_ARRAY

IS_OBJECT

IS_PRIMITIVE (these can be string, boolean, numeric or null)

Shape Output JSON

Earlier in this article you learned how to “flatten” JSON. Inversely, there could be times when you need to create a different JSON structure for the output. Suppose, for example, you wish to take the flattened documents from the simplecars2 collection and display them in a nested format.

The following short piece of SQL shows how this can be done. Please note that I am deliberately not attempting to output all the attributes from the documents.

Here is the result:

Another way to obtain exactly the same result is (and I know that it looks a little weird) is:

Subqueries

As you saw in the previous article, Cosmos DB SQL allows you to use subqueries. One useful application of subqueries is to restructure the JSON in the output of a query. In the following case the subquery “flattens” the JSON structure in the complexcars document format-and the outer query then constructs a totally different document format.

The format this time is completely different, as you have two objects in the output. All in all, this is rather like a standard SQL JOIN.

Conclusion

This concludes your whirlwind tour of Cosmos DB SQL. The good news is that you have already seen a large number of the currently available functions, as well as gained a reasonable overview of the core approaches that you may need when querying JSON documents in Cosmos DB. In fact, there are now very few of the Cosmos DB SQL API functions that you have not seen in these two articles.

This very simplicity is, however, a two-edged sword. It is hard to deny that an experienced SQL programmer will feel frustrated at the absence of a range of functions that have been a fundamental part of T-SQL for years-if not decades. Moreover, it has to be said that Cosmos DB SQL – at least in its current incarnation – is not an analytical tool.

However, I prefer to concentrate on the positives, and to point out that once you have mastered Cosmos DB SQL you can use it to export flattened JSON to an SQL Server using a variety of data ingestion techniques and carry out analytics on a reduced data set in the relational or even dimensional engines that you are currently using. Consequently, I encourage you to think of Cosmos DB and its SQL API as an essential extension to the SQL Server universe, and the SQL it offers -however limited-as essentially an easy way in to the worlds of JSON, document databases, NoSQL and big data.