Experiments with NEO4J: Using a graph database as a SQL Server metadata hub

NEO4J, the graph database, can be used to provide answers that are very tricky for relational databases, including providing diagrams to show how SQL tables relate to each other, and the shortest chain of relationships between two tables, as David Poole demonstrates

Contents

NoSQL for the RDBMS fan

I first heard the term NoSQL back in April 2010 at SQLBits VI Westminster during a session by Simon Munro. His session was certainly thought provoking, and I found his arguments compelling. The gist of them was “do we really need the overhead of an RDBMS for all of our data needs“?

The slides and presentation are still available on the SQLBits web site.

Since 2010, NOSQL solutions have certainly had a lot of press and generated much excitement in the developer community but, it has to be said, considerably less in the DBA community.

One of the points that Simon made during his presentation was that the NOSQL acronym means “Not Only” SQL. NOSQL solutions are not a replacement or successor for RDMBS systems, nor were they ever intended to be. They are useful tools to be used for specific purposes. They are to be used as part of an organisation’s data management solution and not as a total replacement for the existing solution.

Another point to bear in mind is that, by definition, the term NOSQL describes every data storage and retrieval mechanism that is not a traditional SQL database so they represent a very broad church with many hundreds of variations. A good starting point is to read Kristof Kovac’s blog piece on some of the larger players in the market place.

Graph databases

Of all the NOSQL databases, graph databases are the ones that fascinate me most. The idea is that, instead of tables specific to each type of element, there are nodes; and instead of fields representing entities, you simply have attributes and labels.

In the RDBMS world, customers and employees would almost certainly be stored in separate tables. Any relationships between a customer and an employee would then require additional tables to describe the relationship complete with foreign key relationships.

The employee hierarchy that represents the organisational structure would also require additional tables and foreign key relationships.

In a graph database there are merely nodes (sometimes called edges) and relationships (sometimes called vertices). To marry two different nodes together, the appropriate relationship can be created as a self-describing entity in its own right. There is no need for additional objects to facilitate that relationship.

Getting started with graph databases

Microsoft Research has a project called Trinity, which is a cloud-based graph database for which there is a growing body of information, but it is not yet available for use..

Neo4J is a popular open-source graph database and we will use this to demonstrate the power of graphs.

Installing Neo4J on Windows is straightforward.

  • Download from http://www.neo4j.org/
  • Run the resulting EXE file
  • Run the program from the Windows start menu
  • Click the Start button on the dialogue box shown below
    2001-img5.jpg
  • Open a browser and go to http://localhost:7474 as shown in the dialogue box and you are ready to begin.

First look at the Cypher language

Neo4J has four ways of interacting with the database

  • Using the API in a JVM-friendly language such as Java, Scala or Clojure.
  • Using the REST web service interface
  • Using the Cypher language in the web user interface
  • Using the Cypher language in the Neo4J command shell.

The API is analogous to the SQL Server Management Objects offering a richer set of features than the Cypher query language.

Using Cypher within the web user interface is a good starting point for learning about graph databases and get used to the different way of thinking required to get the most out of them.

To some extent, Cypher is a pictorial language in that the queries look like the pattern of nodes and relationships that you are trying to retrieve.

  • Nodes are described with rounded brackets
  • Relationships are described with square brackets

A simple Cypher query look something like the following:-

The MATCH statement is describing the node and relationship pattern we want to retrieve.

The RETURN statement is analogous to the SQL SELECT statement. In the example above it is like saying

So this cypher query simply retrieves all the nodes that have a directed relationship of any description with a node with a name attribute of “SalesTerritory “.

Cypher also has a number of keywords that have a direct equivalence with SQL which makes it a curiously familiar language.

  • WHERE
  • ORDER BY
  • LIMIT
  • SUM, COUNT, STDEVP, MIN, MAX etc
  • LTRIM, UPPER, LOWER, REPLACE, LEFT, RIGHT, SUBSTRING
  • DISTINCT
  • CASE

The Neo4J web user interface

The web user-interface is clearly laid out as shown below

2001-Neo4J_CREATE_01-54bbc090-8ef6-4f41-

The left -hand pane is a short-cut mechanism for running pre-canned queries and accessing help content.

The large top-panel allows Cypher queries to be entered where they can be run or saved into the left hand panel. Note that files containing queries can be dragged into the bottom of the left hand navigation highlighted in red.

One useful tip is to include a comment at the top of the query and save it using the STAR symbol on the top right of the query pane. For example saving the following will show “Delete everything” as a saved query in the left hand pane.

The bottom half of the screen shows the results of queries.

Creating nodes and relationships

A word of warning before we start: The web user interface where we will be entering our Cypher queries is fussy about white space. Always leave a space between the query clause and the bracket that denotes the node.

So let us start with some entities which we are going to represent as nodes.

  • Red-Gate software – Company
  • Simple-Talk – website
  • SQLServerCentral – website
  • David Poole – amateur writer
  • Steve Jones – Professional writer and editor

The Cypher to create these five entitles would be as follows:-

  • MERGE (:Company{name:"Red-Gate software"})
  • MERGE (:Website{name:"Simple-Talk",url:"http://www.simple-talk.com"})
  • MERGE (:Website{name:"SQLServerCentral",url:"http://www.sqlservercentral.com"})
  • MERGE (:writer{name:"David Poole",category:"Amateur"})
  • MERGE (:writer:editor{name:"Steve Jones",category:"Professional"})

This can be copied and pasted into the top pane of the web user interface as shown earlier.

We can now start to create the relationships. To do this we use the MATCH statement to retrieve two nodes and then create a relationship between them

We are being quite explicit in our MATCH statement.

  • Retrieve a node with a label of “Company” and whose name attribute is “Red-Gate software”
  • Retrieve a node with a label of “Website” and whose name attribute is “Simple-Talk”

Given that the names are unique we could also have written our MATCH statement as …

Or, instead of selecting the attributes in our MATCH statement, we could have written the query as follows:-

This is fine if we want to create our relationships one-by-one: but what if we want to create our nodes and relationships en-masse? We can do this by executing the script below.

By giving our nodes identifiers (a, b, c, d & e ) when we create them, we can then use those identifiers further down our script to create the relationships

By running the following Cypher script, we will get a graphical representation of the items we have just created.

This simply retrieves every node and returns it.

The difference between the CREATE and the MERGE statement

It seems strange not to use a CREATE statement to create things. In fact we could have done so by replacing the MERGE statement with CREATE in our script and got exactly the same result.

The key differences are as follows:-

  • MERGE will create a node or relationship only if there is no identical node or relationship.
  • CREATE will simply create nodes and relationships irrespective of whether identical ones already exists.
  • CREATE can create multiple nodes and relationships all at once. For example we could create the relationship between Red-Gate software, Simple-Talk and SQLServerCentral in one command.

The Neo4J command shell

Neo4J does possess a command shell to give the equivalent of SQLCMD. Although this is not included in the community edition installer it is available for download.

To get this working with my existing installation of Neo4J I extracted the Neo4JShell.bat file from the downloaded ZIP file into my \Neo4j Community\bin\ folder.

After some experimentation I found that I had to edit Neo4JShell.bat

  • If the batch file complains that it cannot find a JAR file then take the following steps
  • Open the batch file in your preferred editor
  • Navitage to the line that appears as follows
  • Replace the entirety of %CLASSPATH_PREFIX and %CLASSPATH% with the path and JAR file name of neo4j-desktop-2.0.1.jar. In my particular case the line was edited to the following: –

Differences between the command shell and the Web UI

In both cases the batch terminator is a semi-colon character.

The Web UI can only execute a single batch even when it is running an imported file

The shell can run multiple batches provided each batch is separated with the semi-colon batch terminator.

The semi-colon batch terminator is mandatory in the Neo4J shell

Using Neo4J in conjunction with SQL Server

The main strength of a graph database is its ability to traverse the graph to find patterns. This includes the ability to find the shortest route between two points.

One of the challenges facing users of a data warehouse is to understand how the different tables within the warehouse relate to each other.

A graph database can solve this problem easily if we copy the table and foreign key metadata into a Cypher script.

Creating the nodes

For any given table we could set up nodes to record the following attributes:-

  • Schema name
  • Table name
  • Table description – As stored in the MS_DESCRIPTION property.

As we will want to form the relationships we will also want to attach an identity to the node and assign different labels to the objects.

Using the Adventureworks database as the basis for creating nodes a typical node creation statement would appear as follows:-

We can use a SQL view to generate the Cypher script we need from two system tables

  • Sys.tables
  • Sys.extended_properties

The key points for the query are as follows:-

  • We use a common table expression to separate out data retrieval from data presentation
  • Not all tables will have an extended property but when they do, we are only interested in the MS_DESCRIPTION property for the table and not any field descriptions represented by minor_id 1+
  • The Cypher node identifier is a concatenation of “n” + the SQL Server object id because we need to reference that identifier to create the node relationships.

I have written the script to use the CREATE statement rather than the MERGE statement simply because any changes to the MS_DESCRIPTION property will cause MERGE to create a new node thereby limiting its effectiveness. Repopulating the graph is best done by deleting everything and starting again.

Creating the relationships

The Cypher to create the relationships will appear as follows:-

The query to generate this can be generated straight from the system view sys.foreign_keys . Again we can encapsulate the query in a view.

To get our full Cypher script we could simply wrap both views up in a master view separated by the UNION ALL statement.

Querying the graph

To get a view of the entire graph simply run the “Get Some Data” command from the left hand pane of the Neo4J user interface. This simply issues the following command

This will produce a graphical representation of Adventureworks similar to the one below.

2001-NEO4J_Adventureworks_ALL-f6192d64-e

This is certainly picturesque but the real power of the graph becomes apparent when I run the query below

In this query we are specifying two tables and asking the graph to find the shortest chain of relationships between the two up to a maximum of 15 links. The pictorial representation is shown below.

2001-NEO4J_ShortestRelationship-7d560af4

By clicking on the “Customer” node a dialogue box will appear as shown below.

2001-NEO4J_NodeProperties-a2b00121-3937-

Here we can clearly see the properties we extracted from the Adventureworks schema.

Automating the import

The existence of the Neo4JShell.bat makes automation a straight-forward process.

The first stage is to create a SQL view that will return the Cypher that will remove all nodes and relationships from the graph

Note the batch terminating semi-colon at the end of the generated Cypher command.

I create a further view to generate all Cypher commands required to enable the Neo4J shell to function

As mentioned earlier a batch terminator is mandatory in the Neo4J shell hence the final SELECT in the above view.

I created a directory called c:\data\ and used BCP to pipe the output from this view to a file.

We are now ready to put together the constituent parts we need to build an automated metadata graph.

Within my c:\data\ directory I created a Windows command file as follows:-

This successfully extracted the desired Cypher commands to a file and then executed those Cypher commands in the Neo4J shell.

Data warehouse concerns

The technique that we’ve described so far relies on the SQL database containing foreign key constraints.

In SQL Server, foreign keys can be disabled during the load, and the ETL process will then handle the data referential integrity enforcement. The syntax to do so is as follows.

I have seen data warehouse implementations where the foreign key relationships were left out altogether and therefore the techniques described in this article would not work.

I would take the view that foreign keys provide useful metadata, even if they are disabled and have no functional affect in the warehouse.

In database platforms such as HP Vertica and Amazon Red-Shift, foreign key constraints are purely metadata items. Note that HP Vertica does have a function that allows the retrospective identification of constraint violations.

Avoiding graph anti-patterns

It is possible for the relationships themselves to have properties so in our Adventureworks example it would be tempting to add details of the foreign key relationships to our Neo4J relationships.

This would be an anti-pattern

Properties on Neo4J relationships are intended to model the weight or importance of a relationship.

If we wanted to represent foreign key relationships in our Neo4J model and have a richer set of informational attributes then it would rather suggest that we should represent foreign keys as nodes in their own right.

Working with other database platforms

The technique described here will work with any SQL database that allows its table and relationship metadata to be queried.

MySQL Example queries

MySQL requires a slightly different query as it does not have the concept of an object id property or schemas within a database. In this case we will use the table name as an identifier.

Note that in MySQL, every property including the table description is held in viewINFORMATION_SCHEMA.TABLES.

HP Vertica Example queries

HP Vertica is a platform where there is a single database instance containing one or more schemas.

It also has a rich set of system tables in a schema calledv_catalog . These are similar to the SQL Server sys views.

Concluding thoughts on Neo4J

I have been impressed by the ease of use in setting up Neo4J and getting it working.

I was very struck by the fact that, with a very simple use-case, I soon got tangible benefits. This gave me a sense of the considerable potential of graph databases.

The Neo4J online documentation and tutorials are of a good standard and sufficiently clear to encourage experimentation. The inclusion of the “movies” database in the distribution gives a sufficiently varied dataset to try out most of the functionality on offer.

The web user interface, with its pictorial representation of the Cypher query-results, is a compelling way of demonstrating the effectiveness of a graph databases. Data can be a dry subject, so to have a user-interface that adds zest can only be a good thing.

I will certainly be carrying out further experiments with the technology.