14 August 2023

3141 views

4 0

Working around schema drift in SQL Server

At Stack Overflow, our environment has multiple implementations of a largely – but not 100% – identical schema. (By “schema,” I mostly mean the set of tables within a database.) I sometimes need to retrieve or update data across a large number of sites. Let’s say, pulling or removing information from the Users table, and related tables, across all of our databases.

This gets complicated.

Because:

Each of our Q & A sites has its own database, and each database has a table called Users (and some metadata is split out into a separate UserMetadata table).
A few sites are special, like Area 51, where the Users table looks a little different (it is wider, because the UserMetadata split isn’t present here for historical reasons).
A few others look like Q & A databases, like Chat. They also contain a table called Users, but here the table is used differently and the column structure is unique.
Stack Overflow for Teams is more complex. A team’s tables are shared with others within a single database, with the boundary being a schema instead of a database. There is no table called dbo.Users, but a database may contain multiple Users tables, such as Team0001.Users, Team0002.Users, and so on. I’ll deal with this special case in a future post.

IntelliSense highlights a few of these differences straight away:

IntelliSense reveals schema drift

Schema has “drifted” in these and other cases, for example when features are only applicable to one site, team, or product. Features like Collectives, Saves, or Staging Ground can lead to new columns or even entire tables that are only present in some databases.

This is okay, and there are many ways to deal with drift – even drift by design. But that’s not the point of this post. Because even if I could convince everyone all of our databases should have identical schema, I can’t snap my fingers and make it happen overnight. In the meantime, I have to deal with these differences.

A fake example.

Imagine I want to collect a list of users with more than one answer and who mention groundhogs in the “About Me” section of their profile. In any Q & A database, I can issue the following query:

SELECT u.Id, u.AnswerCount

FROM Users AS u

INNER JOIN UserMetadata AS um

ON u.Id = um.UserId

WHERE um.AboutMe LIKE N'%groundhogs%'

AND u.AnswerCount > 1;

That works fine across “normal” site databases.

But there are exceptions.

In the Area 51 database, that query fails. As mentioned above, there is no UserMetadata table, and AnswerCount is not aggregated anywhere. I have to write the query differently:

SELECT u.Id, AnswerCount = COUNT(p.Id)

FROM Users AS u

INNER JOIN Posts AS p

ON u.Id = p.OwnerUserId

AND p.PostTypeId = 2

WHERE u.AboutMe LIKE N'%groundhogs%'

GROUP BY u.Id

HAVING COUNT(p.Id > 1);

In the Chat database, this query makes no sense, because there are no answers. So I want an easy way for this database to be skipped entirely.

The problem.

I don’t want to hard-code lists of databases that happen to support one version of the schema or another right now. This reminds me of browser detection techniques that rely on the user agent string the browser presents, rather than testing the required functionality. Picture maintaining a list of all possible user agent strings and keeping it up to date. This is simply unmanageable, and I don’t want to do it for database names, either. I want it to be dynamic, so I don’t have to update some list somewhere every time a new database gets added, existing schema drift gets corrected, or new schema drift appears. And sometimes new schema drift will appear, like with new features, as mentioned before. We know about this drift; it goes through PRs and reviews, and is what I call controlled drift. The technique I use relies on knowing the schemas are well controlled, and being aware of changes coming out. So if someone goes rogue and creates a new table in Area51 called dbo.MoreUserMetadata, and starts stuffing information about groundhogs in a random column there without anyone’s knowledge, that data will go unnoticed. That is one caveat to the solution I’ll propose.

A piece of the puzzle.

I have to repeat this process for a growing list of queries. So, in our central DBA database, I have an Actions table that I use to validate the metadata in each database before attempting to run a given query there.

USE DBA;

CREATE TABLE dbo.Actions

(

ActionID int,

QueryText nvarchar(max),

CheckTable1 nvarchar(128),

CheckColumn1 nvarchar(128),

CheckTable2 nvarchar(128),

CheckColumn2 nvarchar(128)

);

I put these queries into the table (minus the statement terminator, since these will later be placed inside a common table expression):

INSERT dbo.Actions

(

ActionID, QueryText,

CheckTable1, CheckColumn1,

CheckTable2, CheckColumn2

)

VALUES

(1, N'SELECT u.Id, u.AnswerCount

FROM Users AS u

INNER JOIN UserMetadata AS um

ON u.Id = um.UserId

WHERE um.AboutMe LIKE N''%groundhogs%''

AND u.AnswerCount > 1',

N'Users', N'AnswerCount',

N'UserMetadata', N'AboutMe'),

(2, N'SELECT u.Id, AnswerCount = COUNT(p.Id)

FROM Users AS u

INNER JOIN Posts AS p

ON u.Id = p.OwnerUserId

AND p.PostTypeId = 2

WHERE u.AboutMe LIKE N''%groundhogs%''

GROUP BY u.Id

HAVING COUNT(p.Id) > 1',

N'Users', N'AboutMe',

NULL, NULL);

As I loop through each database, I build dynamic SQL that checks for the proper metadata before replacing tokens and executing the query. To demonstrate, first, let’s set up a few databases that meet our various criteria:

CREATE DATABASE StackOverflow;

USE StackOverflow; -- only query 1 can run here

CREATE TABLE dbo.Users(Id int, AnswerCount int);

CREATE TABLE dbo.UserMetadata(UserId int, AboutMe nvarchar(max));

INSERT dbo.Users VALUES(1,5),(2,0);

INSERT dbo.UserMetadata VALUES(1,'I love groundhogs!');

CREATE DATABASE Chat; -- neither query can run here

USE Chat;

CREATE TABLE dbo.Users(UserId int);

INSERT dbo.Users VALUES(1),(2);

CREATE DATABASE Area51; -- only query 2 can run here

USE Area51;

CREATE TABLE dbo.Users(Id int, AboutMe nvarchar(max));

CREATE TABLE dbo.Posts(Id int, OwnerUserId int, PostTypeId int);

INSERT dbo.Users VALUES(5,'I hate groundhogs!');

INSERT dbo.Posts VALUES(1,5,2),(2,5,2);

I’ll use a cursor to step through each action within each database, and check if it has supporting metadata. I keep it flexible so that I can check one or two tables, and zero or one columns in each table. Why zero? For some queries, I know the columns exist in all databases, so I don’t need to check.

DECLARE @debug bit = 0; /* set to 1 to see output instead of execution */

DECLARE @go bit,

@actions cursor,

@actionId int,

@db sysname,

@context nvarchar(255),

@checkSQL nvarchar(max),

@queryText nvarchar(max),

@t1 nvarchar(128), @c1 nvarchar(128),

@t2 nvarchar(128), @c2 nvarchar(128),

@tCheck nvarchar(255) = N'EXISTS (SELECT 1 FROM sys.tables AS t

WHERE name = ',

@cCheck nvarchar(256) = N'EXISTS (SELECT 1 FROM sys.columns AS c

WHERE c.object_id = t.object_id

AND c.name = ',

@params nvarchar(max) = N'@t1 nvarchar(128), @c1 nvarchar(128),

@t2 nvarchar(128), @c2 nvarchar(128),

@go bit OUTPUT';

/* temp table to hold the results from each database */

DROP TABLE IF EXISTS #Results;

CREATE TABLE #Results(ActionID int, Source nvarchar(255), UserId int, AnswerCount int);

/* cursor to loop through each action for each database */

SET @actions = cursor FOR

SELECT act.ActionID, db.name, act.QueryText,

act.CheckTable1, act.CheckColumn1,

act.CheckTable2, act.CheckColumn2

FROM dbo.Actions AS act

CROSS JOIN sys.databases AS db

WHERE db.state = 0 AND db.database_id > 4

ORDER BY db.name, act.ActionID;

OPEN @actions;

FETCH @actions INTO @actionId, @db, @queryText, @t1, @c1, @t2, @c2;

WHILE @@FETCH_STATUS <> -1

BEGIN

/* set the right database context, and construct the metadata

check depending on whether we need to validate one or two

tables and/or columns.

The tables/columns are passed into the dynamic SQL based

on what comes back from the Actions table. */

SELECT @context = QUOTENAME(@db) + N'.sys.sp_executesql',

@checkSQL = N'IF ' + @tCheck + N'@t1'

+ CASE WHEN @c1 IS NOT NULL THEN N' AND ' + @cCheck + N'@c1)'

ELSE N'' END + N')'

+ CASE WHEN @t2 IS NOT NULL THEN N' AND ' + @tCheck + N'@t2'

+ CASE WHEN @c2 IS NOT NULL THEN N' AND ' + @cCheck + N'@c2)'

ELSE N'' END + N')'

ELSE N'' END

+ N' SET @go = 1; ELSE SET @go = 0;';

IF @debug = 1

BEGIN

PRINT CONCAT_WS(char(13), @db, 'Action:', @ActionID, 'CheckSQL:', @checkSQL);

PRINT CONCAT('@t1/@c1: ',@t1, '.' + @c1, char(13), '@t2/@c2: ',@t2, '.' + @c2);

END

/* check if the metadata exists in the target database: */

EXEC @context @checkSQL, @params, @t1, @c1, @t2, @c2, @go OUTPUT;

IF @debug = 1

BEGIN

PRINT CONCAT('Should action #', @actionId, ' run here (', @db, ')? ', @go);

END

/* if it does, we're good to go! */

IF @go = 1

BEGIN

/* wrap the query in a CTE so we can inject action/DB: */

SET @queryText = N'WITH cte AS (' + @QueryText + ')

SELECT @ActionID, DB_NAME(), * FROM cte;';

IF @debug = 1

BEGIN

PRINT CONCAT('Would have run:', char(13), @queryText);

END

ELSE

BEGIN

INSERT #Results(ActionID, Source, UserID, AnswerCount)

EXEC @context @queryText, N'@ActionID int', @actionId;

END

FETCH @actions INTO @actionId, @db, @queryText, @t1, @c1, @t2, @c2;

END

IF @debug = 0

BEGIN

SELECT ActionID, Source, UserID, AnswerCount FROM #Results;

END

Results.

Even though I have very different schema across (many!) different databases, this technique allows me to pull a consolidated result without knowing which ones store the same data in slightly different ways:

Results in spite of schema drift

If you are trying to get a handle on what the code does, the parameter @debug provides some insight. If you execute the script with @debug = 1, you’ll see this output instead of the results. This shows what code is produced and executed by sp_executesql:

Area51

Action:

CheckSQL:

IF EXISTS (SELECT 1 FROM sys.tables AS t

WHERE name = @t1 AND EXISTS (SELECT 1 FROM sys.columns AS c

WHERE c.object_id = t.object_id

AND c.name = @c1)) AND EXISTS (SELECT 1 FROM sys.tables AS t

WHERE name = @t2 AND EXISTS (SELECT 1 FROM sys.columns AS c

WHERE c.object_id = t.object_id

AND c.name = @c2)) SET @go = 1; ELSE SET @go = 0;

@t1/@c1: Users.AnswerCount

@t2/@c2: UserMetadata.AboutMe

Should action #1 run here (Area51)? 0

Area51

Action:

CheckSQL:

IF EXISTS (SELECT 1 FROM sys.tables AS t

WHERE name = @t1 AND EXISTS (SELECT 1 FROM sys.columns AS c

WHERE c.object_id = t.object_id

AND c.name = @c1)) SET @go = 1; ELSE SET @go = 0;

@t1/@c1: Users.AboutMe

@t2/@c2:

Should action #2 run here (Area51)? 1

Would have run:

WITH cte AS (SELECT u.Id, AnswerCount = COUNT(p.Id)

FROM Users AS u

INNER JOIN Posts AS p

ON u.Id = p.OwnerUserId

AND p.PostTypeId = 2

WHERE u.AboutMe LIKE N'%groundhogs%'

GROUP BY u.Id

HAVING COUNT(p.Id) > 1)

SELECT @ActionID, DB_NAME(), * FROM cte;

... repeat for every database ...

I’ll follow up soon with details on some of the more special cases.

Working around schema drift in SQL Server

This gets complicated.

A fake example.

But there are exceptions.

The problem.

A piece of the puzzle.

Results.

Subscribe for more articles

Rate this article

Aaron Bertrand

Related articles

Strategies for queries against bit columns

Snake draft sorting in SQL Server, part 2

Tags