Introduction to SQL for Cosmos DB

This article by Adam Aspin reviews the Azure Cosmos DB SQL API from the perspective of the relational database developer. More specifically it will show you how to leverage your Structured Query Language skills to exploit the core possibilities of Cosmos DB as a NoSQL document database.

One of the interpretations of the term NoSQL is “Not Only SQL.” Whether this is a factual statement, or an aloof remark is open to debate. However, no one can deny that more and more non-relational data sources are being used in a range of environments. This often leads to both SQL and NoSQL technologies being used side by side despite the differences in both the structures of the data itself and the engines that store the information.

The inevitable challenge for traditional database developers and DBAs is the investment in time and effort required to master a new set of technologies and any new languages required to handle NoSQL data. Fortunately for SQL developers (and indeed for anyone with a decent grounding in SQL) the developers of Cosmos DB have spared a thought for the millions of SQL users who need a path into the brave new world of NoSQL. They have achieved this by providing an SQL API to access JSON-formatted documents stored in Cosmos DB.

This approach brings one instant and overarching advantage for SQL Server (or other relational database) programmers: you are instantly at home with the core language used to extract data.

However, JSON documents are far removed from relational structures and NoSQL document stores are very different beasts compared to relational databases. Consequently, the SQL used to query JSON documents is different in many ways to the conventional SQL that you use with SQL Server. Moreover, Cosmos DB SQL is severely limited compared to T-SQL Yet despite its restrictions, the Cosmos DB SQL API provides an easy path to understanding and exploiting document databases. Acquiring a working knowledge of how to query Cosmos DB could open up new horizons that enable you to integrate data from the relational and NoSQL worlds.

The Basics

Cosmos DB is a multi-model NoSql database. Currently it can handle three types of non-relational data:

  • Document databases
  • Graph databases
  • Key-value databases

Only one of these data models can be queried using SQL in Cosmos DB. This is the document database. Indeed, this is probably a good place to add that Cosmos DB SQL only concerns querying document databases. There is no DDL (data definition language) involved.

A document database is a non-relational database designed to store documents that have no fixed structure. The leading document databases use JavaScript Object Notation (JSON) as the format for structuring the data. The good news with a lightweight and easily modifiable approach like this, is that you are not bound by a rigid schema. The bad news is that you are not bound by any schema at all. This fluidity has many sterling advantages but comes at a cost when querying the JSON data. The two main challenges that you are likely to face as a SQL developer are:

  • Schema on read – Instead of the database schema being part of the database structure, any required structure is defined when the query is written.
  • Nested structures – JSON documents are objects that can contain the complete data describing a self-contained unit. They can be extremely complex and represent a structure that would require a series of tables in a SQL Server database.

Fortunately, Cosmos DB SQL has been adapted to help you overcome the challenges inherent in using a relational database query language to query non-relational data stored in the nested structures of JSON-as you will see in the subsequent article.

When Would you Need Cosmos DB SQL?

Before leaping in to the minutiae of a new or at least a slightly different-dialect of a programming language, you may be forgiven for wondering why you should need to make the effort to master it in the first place. A few potential reasons are:

  • Cosmos DB can store JSON documents at a scale that is truly impressive, and that makes it into an excellent repository when dealing with terabytes of JSON files. SQL Server’s capabilities as a JSON document store are completely overshadowed by Cosmos DB.
  • Cosmos DB can become an ideal complement to SQL Server as a JSON storage service. This is because SQL Server provides analytical capacities that are missing from Cosmos DB. In practice, you can extract finely filtered data from a vast store of documents in Cosmos DB and then load this (infinitely smaller dataset) into SQL Server tables as rowsets for in-depth analysis. This can avoid having to implement a totally different set of technologies to deliver analytics over JSON document stores.
  • You can use PolyBase in SQL Server 2019 to connect to Cosmos DB collections. In this case you will probably need some basic Cosmos DB SQL to flatten the JSON data so that it can be used as an external table.
  • Even if you rely only on Cosmos DB to analyze JSON data you will likely need this Cosmos DB flavor of SQL when writing JavaScript-based stored procedures and user-defined functions.
  • If you are connecting to Cosmos DB using ODBC you are likely to need to flatten the JSON data. This approach will inevitably require the use of Cosmos DB SQL.
  • You can use Cosmos DB SQL SQL API (and SQL over JSON document collections) as the data source in Azure Data Factory activities.

There are, doubtless many other reasons, but I hope that some of these will encourage you to start looking at Cosmos DB.

The Required Toolkit

You need one of two things to practice SQL in Cosmos DB:

  • A Cosmos DB account in Azure
  • The Cosmos DB Emulator which is beautifully explained in this Simple-Talk article.

For the purposes of this article, I will presume that you are using the Cosmos DB Emulator-although either the emulator or Cosmos DB in Azure will work equally well when learning to query JSON documents using SQL.

Sample Data

As this article is a gradual introduction to querying Cosmos DB documents, you will be using simple JSON documents stored in a single collection in Cosmos DB.

The JSON structure that you can use to practice the basics of SQL queries using Cosmos DB is completely flattened, and looks like this:

Using the Sample Data

The sample documents are in the attached zip (CosmosDBQueries.zip) file which you should install onto your C: drive in the C:\CosmosDB directory. The seven documents that make up the collection are in the simplecars subdirectory.

The first thing to do is to create a collection. I suggest using /CountryName as the partition key in this example. As creating databases and partitions is explained in the article referenced above, I will not go through it again it again here. Then you need to load the documents. As there are only a handful of them, you can just click the New Document button and paste the contents of each file into a separate document, and then save it.

There is also the Upload option for loading multiple files which you can use instead. However, this does not seem to work at the moment in the Cosmos DB Emulator.

Of course, in the real world, there are solutions that you will need to learn if you are loading hundreds of thousands of documents and terabytes of data. However, these techniques are out of scope for this simple introduction.

Basic Terminology

For the sake of clarity, there are a few basic definitions that should help you bridge the gap between SQL and the world of JSON documents:

  • A collection can be considered to be a database
  • A document equates broadly to a recordset (although the nested structure makes it more similar to XML)
  • An attribute is a field or column

These points of comparison are not destined to be taken too literally and are simply provided as an initial stepping stone to help with your initial understanding if you have never seen document databases before.

Basic SQL Queries

Begin with the simplest possible example. With the Cosmos DB Emulator running, enter and execute the following query:

You will see all the documents in the current collection returned as the output, each one similar to the sample document shown above.

As simple as this query may be, as it is the first query that you have created with Cosmos DB SQL a few comments are in order. Firstly, the data is returned as JSON, even if you are using SQL to query the documents. In effect, JSON is the data format. A second point is that you cannot specify the collection to query in the SQL. Consequently, you need to ensure that you open a query window from the required collection. However, you can use anything as the definition of the data source. The query could read:

When running simple queries, the source JSON is returned as it exists in the collection. You will learn how to shape the output JSON later in this article and in the second article in this series.

A final point to note is that, technically, you can also return a complete JSON document with this SQL:

Now to be a little more selective, try this short query:

This is the output you should see the following (here, obviously, is a truncated view of the output):

Executing this query only returns two of the available JSON attributes. What is worth noting here is that you must use the collection name or alias when referencing attributes. It is generally easier to use short aliases for the collection, like this:

The AS keyword when aliasing collections is optional. Attribute names are case-sensitive, although misspelling them will not stop the query executing it will prevent the attribute from being returned in the output.

Of course, you can add aliases to attributes:

Doing this will deliver the following JSON:

However, when aliasing attributes you need to remember that:

  • The AS keyword is optional
  • Aliases are case-sensitive.
  • Aliases must respect JavaScript naming conventions – you cannot start an alias with a number or other non-alphabetical character except the dollar sign or underscore

You cannot add an alias to * so you cannot write

I realize that the JSON specification calls attributes “members” and the data for each member an element, yet the use of the word attribute to describe the name of a piece of data is so prevalent I prefer to use it.

Cosmos DB SQL will return strings, numbers and dates exactly as they are stored in the JSON document – as you can see if you try this query:

The result will be:

Of course, Cosmos DB SQL can perform basic arithmetic:

As you can see below:

As a variation on a theme, you can always write SELECT clauses like this:

Indeed, you can mix and match the ‘alias dot attribute’ notation and the ‘alias square bracket and double quote’ notation (in which case there are no dots used to link the alias and the attribute name) inside the same query.

Simple WHERE Clauses

Time to move on to data selection. After all, Cosmos DB can store terabytes of JSON documents and allows you to scale queries to use more or less processing power so that you can balance the requirements of query time and cost (in the financial sense) to suit each individual requirement.

A Basic Text Filter

Suppose that you want to isolate all invoices where a Bentley was sold. This should not tax your SQL knowledge, and hopefully will reassure you that Cosmos DB SQL cleaves to basic SQL tenets:

You should see a result like the following:

Yes, you are using double quotes to enclose a string. This may come as a surprise to SQL Server developers (although possibly less so to users of other databases). To be fair, you can also use single quotes if you feel that you are otherwise betraying your SQL heritage.

Numeric Filters

Filtering on numbers in JSON attributes is probably what you would expect it to be. You could write a WHERE clause like this one to specify a precise figure:

Or like this one to define a numeric range:

Here, the result will be:

And now is the disconcerting part. You can filter on a field and include it in the SELECT clause at the same time-like this (although this will indicate the filter validity in the results):

This query will give the following result:

To conclude the subject of elementary WHERE clauses, you can also reassure yourself that:

  • The AND, OR, NOT operators function pretty much as they do in T-SQL
  • =, !=, >=, <=, <> operators allow you to specify threshold values in queries
  • You can use parentheses to create more complex query logic

Alphabetical Ranges

If you ever need to return documents based on an alphabetical range in an attribute, then one way of obtaining the desired result is to use this kind of SQL:

Date and Time Filters

Cosmos DB SQL does allow for some more advanced filtering methods in the WHERE clause. Although the possibilities are, it has to be said, considerably more limited than those offered by T-SQL, you can, nonetheless, include (and exclude) irrelevant JSON documents by filtering on date and time elements.

JSON does not have a date type, as such. Instead it uses a string. Consequently, you risk encountering dates in any of a myriad of formats. The upshot is that you will be delving into the specific string format to filter on dates and times.

The format in the sample data is in something close to the ISO 8601 format, which looks like this: YYYY-MM-DDTHH:MM:SS. This means that searching for a specific date and time means representing the datetime as an ISO string, in this way:

If you are looking to set an upper or lower boundary for a date or time you can use the standard comparison operators: >=, <=, <> or !=.

Using Date Ranges

It follows that defining a date range requires nothing more than a simple AND operator in the WHERE clause.

This returns data from three documents:

Or, possibly more elegantly, you can use the standard BETWEENAND construction:

This returns data from the same three documents:

Applying Time Filters

Filtering by time essentially means extracting the hour, minute and possibly second or millisecond elements from the date string. So, to find all cars sold between 8:45AM and 12AM you would write:

This should return data from two documents in the collection:

The lack of a schema in the JSON documents means that the date could be specified in a completely different string format, or even as a number (say in the YYYYMMDD format where the data shown above would be 20150131).

Clearly the requisite queries would have to be adapted to the date format used, in which case a theoretical query on a totally different collection (where the dates were stored differently) could read:

Yet again the ‘free-form’ nature of JSON means that you will have to display both ingenuity and mental agility in the queries that you write.

Traps for the Unwary

Cosmos DB SQL requires you to wrap non-standard attribute identifiers in double quotes (and square brackets). As I mentioned above, no dot notation is used for the collection alias in this case. If you want to output one of the collection’s built-in attributes, you need to write SQL like this:

It is also worth noting that you cannot use unauthorized elements such as leading numbers in aliases even when you enclose the alias in double quotes.

When filtering on texts that contain special characters you need to know that certain characters are escaped in Cosmos DB SQL. Consequently, this is what you need to enter to include these characters in a WHERE clause:

  • Single quote: \’
  • Double Quote: \”
  • Backslash: \\

There are, of course, many others, but I will let you delve into the documentation to find them.

Now that the basics have been covered (and hopefully you are reassured that your SQL skills can be applied to document databases) it is time to look at some aspects of Cosmos DB SQL in greater depth.

Sorting Output

Cosmos DB SQL also contains the ORDER BY clause, so you can write queries like this one:

It probably comes as no surprise that you will see a result set something like the following:

However, at this juncture a series of limitations appear on the horizon. For, although you can add the ASC / DESC keywords to an ORDER BY clause (as you can in T-SQL) the following restrictions apply here:

  • The ASCENDING and DESCENDING keywords are not supported
  • Sorting on multiple attributes is not possible
  • You cannot sort on an alias for an attribute – you must reference the attribute preceded by the collection alias and a dot as you do in the SELECT clause
  • You cannot order by anything other than a field name
  • You cannot sort by a computed value

Summarizing Results

Here are several examples for common aggregates:

Finding the Number of Documents in a Collection

The query to return the number of documents in a collection is a slight extension of standard SQL:

Here you are adding the VALUE keyword to return a number only and not a JSON structure. Consequently, the result looks like this:

You could also write this query as:

Very Basic Aggregations

If you want to aggregate data from a collection, you could try a query like this one:

The output is extremely simple:

However, some fairly profound restrictions appear here (at least with the current version of Cosmos DB).

  • There can only be one aggregation per query
  • No alias can be added to the aggregation if you are using the VALUE keyword
  • You will get an error if the query is a cross-partition query, and you do not use the VALUE keyword. Cosmos DB partitions (as you might expect) are a core storage and query acceleration design approach. However, creating partitions is a subject in its own right that I cannot go into here

Intriguingly, you can get around the forced use of the VALUE keyword when querying across partitions with the use of a subquery like this:

In this case, the output looks something like this:

You are probably wondering where the $1 attribute names come from. This is, quite simply, what happens when you do not use an alias for a subquery or a function in Cosmos DB SQL. Think of it as being similar to non-aliased columns in T-SQL appearing as (no column name) in the query output.

Using a subquery also allows you to use multiple aggregate values in a calculation. In the following example you can see the difference between the maximum and average sale prices:

As you can see, you do not necessarily have to alias the subquery:

Of course, you can filter the data set that the calculation is applied to:

This gives the corresponding output:

As you can see here, applying the VALUE keyword returns just that-a value-and consequently there is no attribute name in the output.

Shaping Output

Cosmos DB SQL allows you to shape the output of an SQL query in several ways – pretty much as ‘ordinary’ SQL does. Some of the techniques that you can apply are described in this section.

Returning the Top N Documents

Cosmos DB SQL imitates T-SQL when it comes to returning only a specified number of documents from a collection. This one does not even need me to show you the result of the query.

Concatenate Attributes

The Cosmos DB SQL CONCAT() function is simple yet powerful. Take a look at the following example:

This function allows you to join multiple fields and/or static values to produce a single output attribute-like this:

Replacing Values in the Output

Should you need to replace specific values in the resulting JSON, you can use the REPLACE() function. And, yes you can use multiple nested replace operations if you need to.

Here is the corresponding query output (from the final objects returned by the query):

String Functions – Simple Classics

Cosmos DB SQL has most of the basic string functions that you could require. Some of these are used in the following query:

Here is the result for a single document:

You need to be aware that:

  • SUBSTRING() is 0 based.
  • You can also use the LTRIM() and RTRIM() functions if you need to.

A Final Few String Functions

Cosmos DB SQL only has a couple of dozen string functions, and we have seen many of the most useful ones already. Should you need them, you can always add the following to your armoury:

REPLICATE()

LENGTH()

INDEX_OF()

Mathematical Functions

Cosmos DB SQL comes complete with a core set of mathematical functions.

Take a quick look at the result:

These are considerably more limited than their T-SQL equivalents, and only carry out basic rounding.

There are a few others such as ABS(), SQRT(), SQUARE() plus a small set of trigonometric functions should you need them.

Logic (Imitation CASE Statements) – Ternary Functions

ANSI SQL has the CASE statement. Cosmos DB SQL instead uses a ternary logic approach. That is, it uses the ? and : constants to break up the code into test ? true : false elements.

To see this in action, take a look at the following Cosmos DB SQL snippet:

This code tests for any of France, Germany or Spain in the CountryName attribute and returns EU if any of these elements are found-and Other if they are not found, as you can see, below:

Fortunately Cosmos DB SQL lets you nest ternary logic (as you may have done using the IF() function in Excel or the T-SQL IIF() functions).

Conclusion

Now that you have a grasp of the basics of Cosmos DB SQL, I hope that you will be tempted to consider using this powerful and scalable document database as a potentially game-changing extension to your relational database environment-and this particular approach to NoSQL as a valuable part of your skillset.

I am not denying that the SQL that is implemented on top of Cosmos DB is severely limited. Although you can often find workarounds to obviate the worst of its restrictions, you are essentially stuck with simple SQL queries that use SELECT, FROM. WHERE and ORDER BY. You will not find any GROUP BY or HAVING clause for instance, and queries are bound to a single collection. Its analytical capabilities are curtailed by the absence of windowing functions.

On the other hand, the sheer simplicity of the query language, as well as the ease with which you can adapt your current skills and learn to query JSON documents without having to learn a new framework and syntax are powerful points in its favour.

However, we have not finished with the guided tour of Cosmos DB SQL yet. A second article will explain some of the more advanced features of the language that can be used to cope with the more complex document structures that JSON allows. This article will also provide some hints and tips on working around some of the limitations inherent in querying JSON documents with SQL.