SQL Graph Objects in SQL Server 2017: the Good and the Bad

Graph databases are useful for certain types of database tasks that involve representing and traversing complex relationships between entities. These can be difficult to do in relational databases and even trickier to report on. Until now, we have had the choice of doing it awkwardly in SQL Server or having an ancillary database to tackle this type of task. SQL Server 2017 will be bringing graph capabilities to the product but will these features prove to be good enough to allow us to dispense with specialised Graph databases? Dennes Torres decided to find out.

SQL Server 2017 will be bringing us support for some of the functionality of graph databases. Graph databases are unlike relational databases in that they represent data as nodes, edges and properties within a ‘graph’, which is an abstract data type representing relationships or connections as a set of vertices nodes, points and edges, like a tangled fish-net. They allow us to represent and traverse relationships between entities in easier ways than is possible with the traditional relational database.

Graph objects are built to represent complex relationships. A hierarchy is a special case of a graph and is useful to record such things as relationship between forum posts and their replies, likes in forum posts, and friendship between people. Hierarchies have a root node (forum post and replies, for example), but many graphs don’t have a root node (people friendship, for example).

In this article, we’ll be building a forum example using the new graph model. I will also compare the complexity of the queries between the graph and a relational model and calling your attention for what’s still missing in this first version. Yes, it’s a Beta technology in its early stages.

Demonstration Environment

I’m using SQL Server 2017 CTP 2.1 which you can download here: https://www.microsoft.com/en-us/sql-server/sql-server-2017

I’m also using SSMS v. 17.0 which you can download here: https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms

Creating the models

This is the Entity-relationship model I will use for comparison:

If you would like to follow along and try out some comparisons, you can create this entity-relational model with the script below, or you can go directly to the graph model. However, you need to create the new database, ‘GraphExample’, using SSMS.

The Graph Model

The planning of a graph model is quite different than the relational model. Tables in a graph model can be Edges or Nodes, we need to decide which tables will be edges and which tables will be nodes.

This image illustrates our graph model:

The decision between nodes and edges in our model is simple: All the entities in a logical model will be nodes, while all the relations in the logical model will be edges. In our model, we have ‘Posts’ and ‘Members’ as entities, ‘Reply To’, ‘Like’ and ‘Written By’ as relations.

Architectural Notes

Nodes and Edges are nothing more than tables with some special fields. There’s no restriction that forbids us to create regular relationships between these tables, turning the model into a mix of relational and graph model.

For example, the relationship ‘Written By’ between ‘Posts’ and ‘Members’ is a simple one-to-many relationship. Instead of using an edge table, we could express it with a regular relation between these tables, creating a mixed model.

When we create a node entity, the entity receives a calculated field named ‘$node_id’. We could use this field as a primary key, SQL Server allows calculated fields to be the primary key of a table: However this field is a JSON field, which is not a good option for primary key for many reasons. For this reason, our nodes have to maintain two keys: The business key, an integer, and the ‘$node_id’ key, an auto-generated JSON key that also contains an integer.

This will be the script for our node entities:

Architectural Notes

After creating the objects, you can then examine the objects in Object Explorer. You may notice a new folder called ‘Graph’ inside the ‘Tables’ folder. All graph objects will be inside this folder.

You may also notice the name of the auto-generated fields. Although we can reference these fields with their short-name, for example, $node_id, the real name of the field includes a GUID. This short name is a pseudo-column and we can use it in our queries.

It is simple to insert the values into the node tables: We just ignore the pseudo-column ($node_id) and write simple INSERT statements. These will be our INSERTs:

You will see the result of SELECTing the records from the ForumPosts table in the image below. You may notice the field $node_id, a JSON field including the type of the entity and an integer id that is automatically generated.

Creating the Edge tables

The edge tables are very simple to create. Edge tables can have properties that would be regular fields in the table, however that’s not part of our example. This will be the script to create our edge tables:

Each edge table has three pseudo-columns that we need to deal with:

  • $edge_id: The id of the edge record
  • $from_id: One of the nodes in the edge record
  • $to_id: The other node in the edge record

Noticing this definition, you may be asking: “Oh, ‘One of the nodes’ is quite vague, isn’t it?”. Yes, sure! That’s an important point: We need to define, in a logical way, what the $to_id and $from_id fields for each edge will mean. You may also be noticing that the name of the edge tables already defines the $to_id and $from_id sides. This is a logical choice of the sides and a good use of the names to make things easier for us.

These will be our logical definitions:

Written_By:

$from_id will be the post

$to_id will be the member

Likes:

$from_id will be who likes

$to_id will be who/what is liked

Reply_To:

$from_id will be the reply to the main post

$to_id  will be the main post

There is no technical restriction about these choices, but we need to keep them when inserting new records, never mixing the meaning of each side of the relation.

Architectural Notes

Besides the three pseudo-columns, all edge tables have some more fields, all of them are hidden fields.

We can see the hidden definition in the field properties and these hidden fields will not appear in query results.

Inserting Edge records

The insert statement for the edge tables needs to fill both sides of the edge, $From_id and $To_id. These fields need to be filled with the $node_id value of the record of each side.

For example, to relate a member with a forum post, the ‘Written_By’ record will have the $node_id value of the post in the $From_id field and the $node_id value of the member in the $To_id field.

These will be the inserts:

Architectural Notes

It’s complex to build INSERTs like these, isn’t it? We will want to use an object framework with support for graph objects in the database. Today entity framework hasn’t got this feature, but we can expect to see something new about this in a while, either as a new feature for entity framework or as a new object model to map graph objects.

Let’s insert the replies records:

(SELECT $node_id FROM dbo.ForumPosts WHERE PostID = 3))

Finally, a let’s insert a lot of likes:

The ‘Likes’ edge illustrates the capabilities of the edge feature rather well. We just inserted several relations between members and posts, but we now decide that in the application a member can also like another member. No problem, we can use the same edge to relate one member to another. In the relational model we need two tables for this, in the graph model, only a single edge.

Let’s insert more likes, now between members of the forum:

The Good: Building queries over the Graph Model

T-SQL brings some new syntax elements into use to query graph tables. We have a special statement to relate nodes and edges in a SELECT.

Let’s do a walkthrough to build a query that retrieves all the posts and their replies:

  1. We will retrieve two posts in each record, the post and the reply, so we will need two references to the ‘ForumPosts’ table in the FROM clause of the select. Let’s do it with some meaningful aliases:

    Although we can, of course, use any alias we choose, it is best to use meaningful aliases when working with graph objects.

  2. We need the relation between the posts, the relation is the table ‘Reply_to’.

  3. In the WHERE clause, we need to relate all the tables. We can do this with the new MATCH clause:

    The syntax is interesting: A single “-” (dash) means a relation with the $From_id field of the edge, and a dash and greather-than (->) means a relation with the $To_id field of the edge

  4. Knowing which alias has the reply and which alias has the replied post, we can build the field list of the query.

    The same query in the relational model would be like this:

    These queries are very similar, although it’s possible to argue that the MATCH syntax is easier to understand

  5. We can already execute this query. The result will be like the image below:

  6. Let’s include the name of the member who wrote the replied post. We need to include the ‘ForumMembers’ node and the ‘Written_By’ edge in the FROM clause. The new FROM clause will be like this:

  7. Include the relation inside the MATCH clause:

  8. Include the member name of the replied post in the field list of the SELECT. The final query will be this:

    The same query against the relational model:

    The image below shows the result:

  9. We still need to include the member name of the reply post. Again, we need to include the ‘ForumMembers’ and the ‘Written_By’ tables in the FROM clause:

  10. The next step is correcting the MATCH clause. ‘ReplyMember’ needs to be related to ‘ReplyPost’, however, how to do this relation without breaking the others? Let’s do in a different direction:

    Notice the <- symbol. It’s in opposite direction than ->, however it’s the same meaning: a relation between $to_id of the edge table with the node table.

  11. Finally, let’s include the name of the member who wrote the reply in the field list and execute the query:

    The result is the following:

    The same query against the relational model:

    At this point you may be noticing that according to the increasing of the number of relations the WHERE clause in the relational model will be much harder to read than the MATCH clause in the graph model.

    Let’s see some more interesting and useful queries against the graph model.

Count of Replies of each Post

In this one we have the count of replies in each post, however only in a single level, not in a tree of replies.

List of Root Posts

We can get all the root posts without the MATCH syntax:

The MATCH syntax only allows us to relate three or more entities (two nodes and one relation). When we want to relate only two of them, we can do a regular join or sub-query, as the above.

Including a Level field in the result

It can be useful to include a ‘Level’ field to expose the structure as a tree. We have a syntax for this in T-SQL: The recursive CTE (Common Table Expression).

However, there is a catch: We can’t use the MATCH syntax over a derived table, in this case the CTE. We can, if needed, use the MATCH syntax in the CTE, but we can’t reference the CTE in it. This is a limitation.

We can, however, use the recursive CTE syntax without be MATCH syntax, using regular relations. The result will be this:

Retrieve all replies of a single post

Using the recursive CTE syntax, we can retrieve all the replies of a single post in a tree style. If we try to retrieve all replies of post 1 using a regular syntax we won’t retrieve the post 3 and we should, because post 3 is a reply to post 2, which is a reply to post 1.

We can only retrieve post 3 when querying the replies of post 1 using the recursive CTE syntax.

This will be the query to retrieve all replies of a single post:

We can also do it in reverse, in order to retrieve all parent posts of a single post in a tree structure. For this we need to do the relations inside the CTE to retrieve the posts and at the outside do some OUTER JOINs to retrieve the parent of each post. It needs to be an OUTER JOIN because the main post has no parent. It also needs to be in the outside of the CTE because the recursive part of the CTE doesn’t support OUTER JOINs.

This brings another point to our attention: There isn’t an OUTER in the MATCH syntax.

Recovering the posts of a single user

Recovering the information of a single user, instead of a post, is easier because there isn’t a tree in this situation:

You may notice the only difference between the queries above is the field list and the use of DISTINCT. The DISTINCT is needed because Peter can reply the same post more than once.

Retrieving the Likes from the model

The queries to retrieve the ‘Likes’ are interesting: The ‘Likes’ edge has relations between members and between members and posts. Each of these selects brings only one of these kinds of relation and ignores the other.

It is also easy to aggregate the information to get the total of likes of each post or each member.

Members who liked and replied to the same post

We can build even some more interesting queries. For example, who are the members who liked and replied to a post?

This is the query:

You may notice the use of the node ‘Member’ two times in the same MATCH expression. This creates a kind of a filter: The member who liked the ‘LikedPost’ needs to be the same member who wrote the ‘ReplyPost’.

This one, over the relational model, would be like this:

It seems for me more difficult to build and read over the relational model.

Members who replied to multiple posts

Members who replied multiple times to the same post:

The only differences between the two above are the field list and aggregation.

Missing features for queries

You probably don’t like the requirement to use recursive CTE to retrieve a tree of nodes. That’s because a feature called transitive closure, usually present in graph databases, isn’t available in this first version.

The ability to find any type of node connected to a single node, for example, all replies and likes of a post, is called polymorphism and isn’t available in this version. We can use UNION to solve this for small sets of node types.

Some graph-specific functions, such as the ability to find the ‘shortest path’ between two forum members, isn’t available in this version.

The Bad: Edge records validation

As well as all the missing features, with a lot of room for improvement, I found problems with the validation of the relations inserted in the model. You may have noticed that it’s too easy to create an edge table. That’s because we don’t need to identify the possible relations of that edge.

This creates some validation problems. For example, the ‘Reply_To’ edge: A forum message can be a reply to another single forum message, however, I shouldn’t be able to insert the same forum message as a reply of two others.

The model doesn’t validate this. We shouldn’t be able to execute the insert below because the message ‘3’ is already a reply to message ‘2’. However, the insert will work:

After trying the insert, delete the inserted record:

How can we solve this problem?

The graph calculated field doesn’t support constraints. The only solution for this problem is the creation of triggers over our edge tables. In this example, we will need a trigger for INSERT and UPDATE.

The following trigger will solve the problem:

Try the INSERT again and this time you will the error message:

Deleting Parent records with Children

This is another interesting problem. The relation isn’t enforced, so we can delete parent records even when they have children. You can try this:

You will notice that the result in the SELECT will have several NULL values because of the deleted record. We can’t even insert the record again, because the relation is done by the $node_id value, which is auto-generated.

Undo the change with a rollback:

We can’t solve this with a simple relationship, such as foreign key constraint, first because the graph pseudo-columns don’t support constraints and also because this would be an impossible solution for and edge such as ‘Likes’, which accepts multiple types of nodes.

Again, the solution is a trigger:

After creating the trigger, execute this block of code again and you will see the error message:

One Forum post shouldn’t like another forum post

The ‘Likes’ edge is much more flexible, however has also its problems. We shouldn’t be able to insert a like between two forum posts, but the model will accept it. The following insert will work:

After trying the insert above, delete the inserted record and any like between posts:

Again, the solution will be a trigger. However, this time we need to check the type of record inserted in the relation. We can use the function JSON_VALUE to extract this information. The JSON_VALUE function can extract information from any JSON field, including the calculated graph fields.

For example, the following SELECT will extract the type of each record inserted in the $from_id field part of the like:

We can use this function in a trigger for insert/update. The trigger will be this:

After creating the trigger, you can try the insert again, it will not be accepted:

Missing Validation Features

All these problems should be solved with some kind of validation syntax on the creation of edge nodes. We should be able to constrain the types of entities that are allowed in the edge node and whether we will allow more than one record with the same entity.

This syntax would avoid the need for triggers. The support for a range of constraints would make the system much more robust.

Conclusion

It’s a first version, SQL Graph in SQL Server 2017 is very promising, even though there are a lot of missing features at this stage. There is plenty to do before this becomes usable and robust, but there is enough so far to be hopeful that Microsoft can deliver a fully-functional graph database within SQL Server

  • At the moment, there is little or any information about how soon we can expect to see the tools that a development is likely to need. We need an object model with support for this, Entity Framework doesn’t support these kind of database objects and I’m not sure if the best option really is to join relational and graph models in the same object model.
  • We need a tool to design this model. Besides the needed support for validation in the syntax of edge nodes, a tool to design this, such as database diagram is for relational models, will be very useful.

After reading this article, probably would like to check:

  • The documentation and architecture of this model. This link will be useful
  • More examples. This link will be useful, it’s a bigger sample and the only link I found talking also about the performance of the graph model
  • More references and other links. This summary of several references will be useful link.