Product articles Flyway Database Deployments
Finding External References in Database…

Finding External References in Database Deployments

This article outlines the syntax of the three and four dot references that usually denote external database references, demonstrates how to find them either from a live database or by using a text search of a script, and explains some of the complicating factors that can lead to 'false positives'.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.



One of the more frustrating aspects of building or deploying a database is an unexpected external reference. The deployment or build will break with an error at the point that it tries to create or alter an object that has an unfulfilled external reference, because the RDBMS will check all references. The process of mopping up after a failed deployment can be tedious, so it is best to ensure, before creating an object, that all the references that it makes can be resolved.

Similarly, you might need a build script, or a Flyway ‘baseline migration’ script, for a database that has external dependencies on databases to which your development machine or build server doesn’t have access. If the referenced objects aren’t there at the point of build, or when a migration is applied, you will get errors.

To avoid this, we need to get a list of the names and definitions of the externally referenced objects. Armed with this list we can, for example, create ‘stubs’ for each of externally referenced objects, so that we can avoid build failures caused by circular, or mutual, dependencies between objects in different databases, or synonyms and stubs when the references are cross-server.

I’ll show how to find these dependencies the ‘easy way’, getting them from a live database using the RDBMS system views, or the harder way when you need to do a text search for them in a set of migration files.

Three and four dot references

There are internal and external references that are made in the SQL of routines such as views, functions and procedures. External ones refer to other databases, on the same server or on another, linked, server. It isn’t always obvious where objects references lurk, but the references that actually prevent the database being installed, updated or built are external references that are coded as a dotted reference.

In most relational databases, a ‘dotted’ reference specifies the location of a named object. Two-part dotted referenced are commonly used to specify the schema within a database, because object names only need to be unique within a schema.

Most RDMBS also support three-part or four-part dotted references. A three-part reference generally refers to another database on the same server, and a four-part object reference is used to refer to a specific object in a specific database schema on a specific linked database server. The four parts, separated by dots (.), are as follows:

  • Server name: The name of the database server that hosts the database.
  • Database name: The name of the database that contains the schema.
  • Schema name: The name of the schema that contains the object.
  • Object name: The name of the specific database object that is being referenced (e.g., table, view, stored procedure).

For example, a four-part object reference to a table named Customers in a schema named Sales, on a database named MyDatabase, on a server named MyServer would be:

Although this is useful when working with several databases or servers, it can also make your SQL code less portable, with what is often known as ‘dependency hell’, because the references will be specific to other database environments that would have to change in synch. Even to build a static non-working system, the referenced objects would need to be there in development and test, as well as production.

The syntax for a four-part object reference will vary slightly between different RDBMS, but most support them, including:

  • Microsoft SQL Server
  • Oracle Database
  • PostgreSQL
  • MySQL (as of version 8.0.13)
  • IBM Db2
  • SAP HANA

Charting the external dependencies of a database

Besides the need to avoid breaking a build, or deployment, there are other good reasons for documenting external references very carefully beforehand. Where a group of databases need to be installed on the same server, the individual databases need to be built or updated in a particular order, known as the dependency order. The logic is that all databases with no dependencies are built first, then the databases that are dependent only on those ones already built, and so on.

Some systems are ‘designed’ with mutual dependencies, which will involve using mocks, stubs, synonyms or aliases to deploy them. If you know all the external dependencies of the database, you don’t need to install the whole set. You can then prepare to install a development or test database using ‘mocks’, ‘stubs’, synonyms and ‘retro links’ to represent all the other databases. A ‘retro link’ is an ODBC link that names a referenced server with an alias that is the same name as the remote server, but actually refers back to the same server, so you can create ‘foreign’ databases on the same development server.

The problems caused by ‘missing’ external references can be avoided by various means if you know about them before you attempt the build or migration. Most RDBMs will provide metadata views that you can query to find them, such as the sys.sql_expression_dependencies DMV in SQL Server. Failing that, they can’t hide from a search of the SQL code for the 3- or 4-part dotted references, even if hidden in an obscure place such as a synonym. We can search either a build script or in the source of the views and functions and various other objects on the live server. If the source of the problem is SQL that originates in the application, or dynamic SQL created in a SQL routine, it is more difficult to find, but less important to locate. This is because you’ll get an error when the database is used, but it won’t ‘break the build’, because it can only be applied to the database from the application, after it is built.

Finding external dependencies using metadata views or profilers

Ideally, you will have an existing database at the right version, in which case you can ‘reverse-engineer’ a map of the external dependencies using system views or RDBMS-supplied tools.

Here are some possible methods to check for external objects being referenced:

  • Use the relevant query tool to interrogate the database schema – there will generally be a system view that tells you what routines (views, procedures, functions etc.) are making calls to other databases on the same or other servers. Many RSDBMSs have special tools, such as an Object Explorer pane, to display dependencies as a hierarchy of object references that include the four-part references.
  • Analyze the SQL statements executed on the database using a database profiling tool, a trace tool, or a system view – this is the only way to track three-part or four-part references from SQL that actually originates in applications and so isn’t present in the database (not a good practice), or from dynamic code executed as a string ‘on the fly’ – the so-called ‘uncontained entity’.

Overall, the best approach will depend on the specific database and the tools and resources available to you. It pays to keep an inventory, preferably in the form of a diagram, of these external references, both the ones that are referencing the database you are developing, and the ones that your database is using to reference other databases.

I’ll show how I do it in SQL Server.

Finding ‘soft’ external references

In SQL Server, I mainly use the system view sys.sql_expression_dependencies for this:

I demo the use of this view for finding dependencies in The Uses of Dependency Information in Database Development

Finding references to “uncontained entities”

In SQL Server, there are other entities that have the potential to use objects that are based outside of the database, and which would need to be in the live server for it to run error-free:

  • A routine containing dynamic SQL or deferred name resolution (known in SQL Server as ‘Unknown containment behavior’)
  • a DBCC command
  • a system stored procedure
  • a system scalar function
  • a system table valued function
  • a system built-in function

To flush these out, SQL Server provides a dynamic management view sys.dm_db_uncontained_entities which can list out these critters for further investigation:

Finding external dependencies using a regex search

If you only have the scripts for the version that you need to build, or your database system doesn’t have the system views that can list them out for you in the live database, you can search for the three and four part object references using wildcards or regexes.

To catch all external references that originate within the database, you search the SQL build scripts or SQL migration scripts, or the object source in the live database. To catch external references in the applications that use the database, you can search the source code of any applications that interact with the database for any occurrences of four-part object references.

This can be done using a code editor or IDE that supports code search functionality. The problem with doing this is that you’ll get a lot of false positives, even if you are confident with Regexes. As with many conventions, there are exceptions to the way that dot references are used, which make it more complicated. Your search will, without care, likely turn up a lot of references that look like external ones but aren’t.

For example, 3-part dotted references can be used redundantly to specify a column in a specific table and schema (e.g.,SELECT dbo.authors.address FROM dbo.authors). This is generally done by accident because, in SQL, column references are always used in the context of a set of tables. Also, many relational database systems allow use of the dot notation to reference other data objects such as XML or JSON documents. For example, the JSON PATH command allows you to put dot references in column names.

To a text search, all these appear to be external references, so the results always need to be checked carefully for these and other sources of ‘false positives’, such as dotted references in strings, comments, print statements or the procedural code. You may also miss certain external references, due to quirks in how the dotted notation is supported. SQL Server, for example, allows an empty reference, such as MyDatabase..MyTable, which means MyTable, in whatever is the default schema of MyDatabase. The reasons for allowing this are lost in the mists of Microsoft.

There is no simple way of  sifting through all this unless you know all the participating servers and databases. To make matters worse, SQL Object references can be quite tricky to find reliably just via a regex, because of the rules they live by. The task is made simpler by using a string tokenizer, but even that isn’t straightforward. Each part of the tokenizer can be a bracketed MS-Access-style name that contains almost anything, a SQL quoted identifier containing word characters, or it can be a proper legal SQL identifier. When it gets to four-part references, that becomes a complicated Regex.

The PowerShell tokenizer cmdlet for SQL searches

A lot of what you read seems to imply that you can simply search for external references in SQL code with a text editor, but that’s only true if you use a tokenizer so as to avoid them in comments or strings, and it will be a lot easier if you also know the names of all the external servers and databases up-front.

I use a PowerShell-based tokenizer cmdlet called Tokenize-SQLString.ps1 that I wrote to help with SQL Searches. This cmdlet provides these references for you. Depending on the way that the SQL was written, these may or may not be external references, so you need to check them ‘by eye’ unless you know the names of all the external servers and databases.

As an illustration, here is a sanity unit-test for the cmdlet, which illustrates how hard a regex search would be. You’ll see various snares for the unwary tokenizer and the variety of delimiters in SQL Server identifiers.

And the correct result?

Value                                      Line Type                   
-----                                      ---- ----                   
MyServer.MyDatabase.MySchema.MyTable          3 4-Part Dotted Reference
MyDatabase.MySchema.MyTable                   5 3-Part Dotted Reference
MyDatabase..MyTable                           6 3-Part Dotted Reference
MySchema.MyTable                              7 2-Part Dotted Reference
[My Server].MyDatabase.[My Schema].MyTable    8 4-Part Dotted Reference
"MyDatabase".MySchema.MyTable                 9 3-Part Dotted Reference
MyDatabase..[MyTable]                        10 3-Part Dotted Reference
MySchema."MyTable"                           11 2-Part Dotted Reference

Here is an example of a pipeline that uses the cmdlet. This takes ambiguous file references so you can search a run of migration files. It can be passed a list. The tokenizer doesn’t work fast so don’t give it huge SQL files unless you need an excuse for a coffee break!

As well as eliminating the possibility of finding spurious references in comments or strings, this also avoids having to conjure up regex strings. As a bonus, it gives you the line and column number, which saves time. You can do many other types of searches with this tokenizer, but that is another story.

Searching a directory of Flyway migration scripts for external references.

Just as an example, we’ll add a cross-server reference to a server call Test migration to one of the migration scripts in our a Flyway Pubs project:

OK. We don’t want to find the cross-server reference in that string in the PRINT statement and we don’t want anything in the comment. Now we need to search through the whole directory of migration scripts to see what we find:

Searching a directory of migration scripts for external references.

Actually, we’ll miss out the scripts that just import data, since it is a waste of rather a long amount of time to search those.

And out comes the result:

Value                        Line Type                    File                              
-----                        ---- ----                    ----                              
test.pubsdev.dbo.authors       21 4-Part Dotted Reference U1.1.7__Add_Indexes.sql           
test.pubsdev.dbo.titles        22 4-Part Dotted Reference U1.1.7__Add_Indexes.sql           
test.pubsdev.dbo.titleauthor   22 4-Part Dotted Reference U1.1.7__Add_Indexes.sql           
People.Abode.End_date         388 3-Part Dotted Reference V1.1.10__AddAddressesPhonesEtc.sql

As well as our test 4-part references (and it didn’t pick up the ones in strings or comments), it picked up a 3-part reference to a column that doesn’t indicate an external reference but is legitimate.

Conclusion

There is no database methodology that makes documentation unnecessary, and the problem of missing external references illustrates the point. Ideally, the person or team deploying or building a database will know the external dependencies and will determine the correct order for a build.

It is certainly possible to reverse-engineer the correct order of build by system views, and so on, but the best practice is to document it up-front when designing the database and put into place the guidelines for a successful build or deployment of the system as a whole. This is why I like to automate as much of the process as possible.

References

Most of what I’ve written on this topic till now is SQL Server-specific, but illustrates some of the problems you’ll encounter with dependencies and the sort of techniques you’ll need to deal with them:

 

Tools in this post

Flyway

DevOps for the Database

Find out more