Database Lifecycle Management (DLM) recommends that you integrate database changes as frequently as possible because it makes change easier and identifies mistakes faster. By integrating changes to database code and components as often as possible, and testing the results, problems become visible while they are still easily managed. Continuous Integration (CI) remains the ideal, though it is not always achievable.
CI doesn’t get rid of bugs, but it does make them dramatically easier to find and remove if the team can be sure that the cause is one of the changes that were made since the last successful integration. If a developer’s code has introduced an error, or has caused another developer’s previously-working piece of code to fail, then the team knows immediately. No further integrations occur until the issue is fixed.
Many of the benefits of Database CI stem from the automated tests that prove that the database meet the requirements defined by the tests, at all times. The unit tests are important: they ensure that the individual units of code in isolation always function, and the integration tests make sure that they work together to implement processes. We’ll cover the specifics of what database tests we need to run in the next article. In this article, we will discuss how to adopt and implement database CI as a practice, its importance in Database Lifecycle Management, and how to overcome common challenges when making the database a full partner in your existing application CI processes.
Why is integration so important?
All applications are made up of components. Even an application hand-crafted from procedural code will depend on third-party components, frameworks, libraries, operating-system interfaces, data sources and databases. At the other extreme, the custom application could be a mashup of standard off-the-shelf applications.
All components are liable to upgrades and changes, especially during active development. Each time any component part of an application changes to a new version, we must perform a complete build all parts of the application that rely on that component, followed by integration, with its integration tests. With a rapidly changing application, integration must be continuous so that we can prove, continuously, that nothing has been broken as a result of any change.
Any change to the application code requires integration, as does any change to a third party component. Where does the database fit in? From the application-perspective, the database is just another component, and a typical corporate application can be dependent on a handful of databases. A database change merely triggers the same work as any other component. The application integration tests must prove that all database access methods continue to perform correctly, according to the database interface specification.
From the database perspective, it is, in effect, an application, with its own hierarchy of components on which it depends. Just as for an application, any change to the database code or components should prompt a fresh build, followed by integration and its tests.
The purpose of Database CI, then, is exactly the same as for application CI. The development team establish a working version of the database very early in the development cycle, and then continue to verify regularly that it remains in a working state as they expand and refactor the schema and database code objects. Developers integrate new and changed code into a shared version control repository several times a day. Development proceeds in small steps. Developers first write the tests that, if passed, will prove that a small new piece of functionality works. They then implement the code to make the tests pass. When the tests pass, they commit the code to “trunk” in the shared VCS, and their “commit tests” are added to the broader suite of tests for the application. Each commit, or check-in, is then verified by an automated database build or migration, and subsequent testing, allowing teams to detect problems early.
Why database CI?
CI within database development remains relatively rare, when compared to application development. Inevitably, “late integration” of database changes causes headaches and delay towards the end of a project. There are often unpleasant surprises that leads to manual scripting, and a lot of tweaking of those scripts, to get the database working as expected. Database CI minimizes the element of surprise.
Database testing is different in nature, and often far more complex, than application testing. The integration testing of a database has to go well beyond answering the question of whether code ‘passes’ or ‘fails’. It must go much more deeply into whether the system performs well under load, and if it scales properly. It should check that access rights are done correctly. Integration test should, in short, ensure that it is providing a service as expected.
However, ad-hoc integration of database changes doesn’t help. In consequence, a database release can cause a lot of subsequent problems in the QA and Production environments. The problems can stem from unreliable stored procedures, functions and triggers to schema changes. Even an apparently simple database change that requires no complicated data migrations can cause problems if not integrated early and tested thoroughly. For example, a change such as adding a new column can cause unexpected failures if certain applications don’t qualify column names, or expect only a certain number of columns or columns in a certain order. Even more potentially damaging to the business are problems affecting data integrity, which stem largely from untested or missing data quality and referential integrity constraints.
Instead, database changes must be integrated and tested regularly, and should be done continuously, if possible. Although shared-development databases are, in a sense, self-integrating, you still ought to integrate them because we are still obliged to test the overall behavior of the database system, and prove that a working database can be built from the component list, which includes not only the database code and schema changes but also SSIS jobs, CLR libraries, agent tasks, alerts, messaging and so on.
Database CI leads naturally to earlier and more frequent releases, which means that all the required scripts and files required to migrate a database from one version to the next will be thoroughly tested in development, and then again when deploying the release to QA and Staging. As a result, deployments to production will be rehearsed and far more reliable. Also, the more rapidly that new functionality and improvements can be delivered, the easier it is to improve collaboration with the DBA and systems team. If we use the same tools to both repeatedly build the latest database version in the CI environment and to deploy the release to QA, Staging and ultimately Production, then we can have more confidence that they will work as intended.
Of course, the subsequent deployment pipeline requires more than this to work. The testing in these more production-like environments is rather different in nature. It will encompass acceptance testing, and verify that security, capacity, availability and performance characteristics of the production system are acceptable to the business or customer, and in line with the signed-off specification of the system. This in turn requires a reliable process for safe delivery of production data, with the necessary obfuscations, into those environments. We’ll discuss these issues in the Release and Deployment article.
Additional benefits of a well-drilled database CI process will emerge as the team gains experience with DLM practices, and introduces more automation and workflow techniques into their database delivery pipeline. For example, if the team have integrated the issue-tracking system into the VCS, as described in the Issue Tracking article, they can respond quickly to any critical issues. Through their CI processes, they will hopefully be able to quickly spin up an environment and establish the correct version of the database, as well as data in a state it existed soon before the error occurred. This often requires a “clean” version of production data (for example, minus any sensitive data elements).They can then reproduce the error, and write a failing test that will validate the fix and prevent regressions, meaning introducing a new problem while solving an existing one.
Application-specific databases versus enterprise databases
As a general principle, it is advantageous to develop applications and their databases together. It makes sense to integrate them as closely as possible to their point of creation, in order to help improve the quality of both code bases, and the interactions between the two. It is much better to settle issues about where any particular functionality is best placed, as early as possible. However, this integration between the application and the database CI processes should not interfere with the individual CI processes for either the application or the database. Instead, it must actively support both processes in order to enhance them. When the database serves only a specific application, and the two are being developed at the same time then you may be able to do the application CI and the database CI together in one integration process, treating it as if the database code were part of the application code.
Some enterprise databases support numerous applications. If you’re working with a database through which several applications co-ordinate, or an external database, then it is likely not possible that you’ll be able to develop your application and the database as a tight unit. Database CI will still bring benefits, but in this case, the database is treated by the application as a separate component. For integration testing of the application we may need to ‘mock’ the database interface based on an “interface contract” that is negotiated with, and supplied by, the database owners. This interface will comprise a set of routines (views, functions and procedures) you call from the application to access the required data. In such projects, there is inevitably special emphasis on the testing that takes place in the production-like environments, such as Staging, which will host versions of all of the applications that use that database.
The decoupling of application and database changes will be enabled by sensible version control mechanisms, as described in the Database Version Control article. Essentially, scripts must exist in version control to roll forward, or back, between any two database versions. Assuming the live database stores somewhere its version number, and the application stores the database version with which it is designed to work, then the CI or deployment tool will always be able to run the appropriate set of scripts to create the version of the database required by the application.
Prerequisites for database CI
Once the team have agreed to implement database CI, perhaps the two major prerequisites are database version control, and automated database builds. However, there are a couple of other practices and tools that will help ensure your CI processes proceed smoothly.
Maintain the database in version control
The first prerequisite for Database CI is that the source of “truth” for the database CI process and all subsequent database deployments must be the build and migration scripts in a version control system, such as Git or Subversion, as described in the Database Version Control article. In other words, the database CI process must always be triggered from the VCS. There must be no ad-hoc builds or database modifications that bypass this formal process.
There are several CI management software services available (covered briefly a little later). Every one I’ve worked with has hooks into various version control systems. With these hooks you can set up mechanisms for automating your database CI processes.
You also have to incorporate into your version control system a mechanism of labeling or versioning successful integrations. If the latest code failed testing, it’s not ready for further integrations until fixes are implemented and it passes the CI process. This is especially true when running overnight database CI testing, with a realistic data load. Never schedule this process to run on a database version that has associated failing tests. Conversely, if the overnight integration tests pass, the database should be assigned a significant new label or version. One approach is to advance the first significant number after the major version number only if the overnight database CI tests, typically run nightly, pass (2.20, 2.30 and so on).
The next day each developer can run the scripts to create this new version in their dedicated development database, or the team can upgrade the shared database, and proceed with development, with fast integration tests running on each commit, which will progress the least significant digit (2.2.1, 2.2.2, and so on).
Automated database builds
Before you attempt database CI, you should already to be able to automate a build of your database from the command line ( i.e. separately from your CI system), using a set of DDL scripts stored in the VCS. As explained in the Database Versioning: State-based versus Migrations article, a well-documented build script for every object makes it very easy for the team to see the exact state of an object at any given time, and to understand how changes to that object are likely to affect other objects. I recommend getting a complete build of an empty database set up as soon as you have your database moved into version control. There are a number of reasons for this:
- It provides you with information on the effectiveness of your database version control implementation
- You can immediately identify gaps, such as missing objects, in your version control management
- You get feedback on your ability to build your database out of source control
Once you’ve established independently that you can successfully build a database from scratch, from its component scripts, then triggering an automated database build through your CI process, in response to database modifications, is a great first step on the road to database CI.
Early and Frequent Commits
The idea of database CI is to have a fast and continuous set of validations of your database changes, so my general advice is integrate like you’re voting in Chicago: early and often (an old joke; you’re only supposed to vote once per election). In other words, rather than hold onto functionality for days or even weeks while you’re working on it, segment the work into discrete sets of changes, such as just one stored procedure, or just the changes necessary to modify a single table, and commit those changes into version control as often as possible. Ideally, you’ll have a “commit-based database CI” process set up (see later) that builds a new database, and runs basic tests, on each commit of a database change: Even if this isn’t achievable, you’ll still want to perform as many successful integrations of your database changes as possible. You’ll then have the benefit of integrating each small change as you work on it, rather than have a single massive set of code to integrate, which is likely to result in multiple failures.
One proviso, of course, is that you should only ever commit a database change to version control after you’ve validated that it’s working on your development machine, via unit testing. If you don’t know if the code is working, don’t commit it into version control. You’re likely to cause the database CI process to fail, but even worse, you might be supplying broken code to other developers.
Isolate the CI Server Environment
The CI Server should be running SQL Server of the same version as production, on a machine with no other service on it, and with restricted logins. Don’t run your database CI processes on a server that also runs additional services, or where people are attempting to get other work done. You CI processes will eat server resources, as they drop and recreate databases, load data, run tests, and so on. If you then add in the need to have this process for more than one development team, for more than one database or set of databases, all this points to the requirement for an isolated environment.
It will ensure that CI testing doesn’t interfere with the work of others, and also that failures are related either to the CI process itself, or problems in the code you’ve modified within version control. Build or migration failures within the CI process should not be caused because someone was mistakenly reading data from the CI database or attempting to run code manually against the CI database. These types of false negatives will not help you improve your CI process or the quality of your code.
Database CI Tools
Notionally, aside from a version control system, the only other components required to begin database CI is a CI Server and a test framework. However, in order to fully support database CI you’ll also need a workflow system or some way to guarantee an appropriate response to a broken build. You will also need a reporting system.
While it is possible to build your own CI server process, it’s a lot of work. Unless you have specialized requirements, it will likely be easier to take advantage of an established CI Server such as TeamCity, Jenkins, CruiseControl, Hudson, or Visual Studio Team Services.
Each of these CI servers offers specialist functionality that that may make one of them more attractive than another, within your environment. Plan for an evaluation period to test two or more of these servers in order to identify the one that works best for you.
At a minimum, your CI Server process should:
- Integrate with your version control environment – including the ability to deal with labels, branching and versions
- Allow granular control over the workflow that determines how the database is built – the order in which actions occur, and the ability to add additional functions such as calls to tests
- Maintain a record of builds – in order to know the success or failure of any given run
- Offer connectivity to email and other alerting mechanisms – so that when the process is fully automated, you can get reports on the successful or unsuccessful builds
Every time a developer commits a database change to the VCS, we will want to trigger a database CI process that performs a build and runs a small suite of tests to ensure the basic behavior of the database structure and code. For our nightly integration builds, we’ll run more extensive integration and acceptance tests with a realistic data load. These tests need to relay information back to the build server; probably via NUnit-style XML output, which any modern CI solution can understand.
Clearly, we need a test framework to run the automated database tests. The options include:
- TSQLt – an open-source T-SQL testing framework that operates within SQL Server
- Pester– a PowerShell testing process that can work with your SQL Server instance
We cover this topic in more detail in the Database Testing article.
Bug-Tracking, Workflow and messaging system
Failed builds and integrations must be written to issue tracking system with as much detail as possible regarding the potential cause. The issue tracking system should be closely integrated with a VCS, and an associated workflow system, such as that available through Team Foundation Server, to allow the team to manage and assign issues quickly and efficiently.
For example, we should never issue further commits on a broken build, for reasons we discussed earlier in this article, so a typical workflow might, in response to a failed build, put the VCS into a “locked” state that will prevent checking in any more work. The issue will be assigned to a developer, whose task is to check in a fix that resolves the issue as quickly as possible and returns the VCS to an “unlocked” state. The team will also need a developer messaging system such as Slack or Gitter, to receive instant notification of build issues.
We cover requirement of the issue tracking system (e.g. Jira, Bugzilla) in the Issue Tracking article.
How to get started with database CI
In order to encourage the practice of database CI, I recommend introducing the following two database CI processes:
- Commit-based , or “fast”, database CI – this process runs every time we commit a change to the database. I refer to this as fast database CI, as it is a fast build of an empty database. Subsequent tests load only sufficient test and reference data to do the validation. The intent is provide initial, fast feedback regarding the validity of your schema and code objects
- Overnight , or “full”, database CI – a scheduled database build, typically nightly, followed by data load, for complete end-to-end process tests, relational integrity checks, testing of migration that involve data movement, testing of ETL processes and so on
As you start running these automated processes, you’ll encounter various forms of failure, arising from bugs in the code, problems with data migrations, failed ETL processes, errors in the integration mechanisms that build or modify the database from the scripts in the version control system, and more. As you encounter, or anticipate, issues that could cause a failed integration, you’re going to want to set up tests that catch further occurrences, and progressively refine the validation of your integrations.
In common practice, application code is generally not unit-tested against the actual database but is better tested against a test harness that provides a ‘mock’ of the database. This ensures that it is the application code that is unit-tested rather than a potentially-erratic database. Unit tests are best done quickly without any external dependencies. However, this becomes a problem, as described earlier, if the teams neglect to fully integrate and test a working database until much later in the development cycle.
It is a better practice, in my experience, to run frequent continuous and nightly database integration from as early as possible in the project. For the nightly database CI process, the application may or may not be present. If the database and application are developed in step, then it makes sense that the application will be wired up. However, in cases where this is not possible, some of the integration tests can simulate the application. For a website, for example, we can generally run in sequence all the calls to progress a basket through the checkout process. Strive for a tight coupling between the application and the database on the CI integrations and tests, but don’t sacrifice the speed needed to ensure fast feedback.
Between the fast (commit-based) and the full (overnight) database CI processes, you’ll have a complete mechanism for testing your databases at the point of development and creation. Let’s explore the mechanisms necessary to make either of these processes work.
Fast database CI
The intent behind fast (commit-based) database CI is to enable lightweight, fast testing of the databases code objects and structures, every time someone commits a database change. It simply performs a complete build of an empty database. Each time we run the fast database CI process, it will drop any existing database, and then generate a script that will build a new database at the required version.
Without running any additional tests, a successful build is, in itself, an important “test” that our fast database CI process can repeatedly build any version of a database from the component scripts in the VCS. However, it’s essential to have the fast database CI process trigger a set of simple tests that will provide your first, immediate, feedback on the new database code and structures that you’ve created or modified.
Since these tests run on every commit, the emphasis is on speed. Some of the tests may load the minimum necessary reference or static data required to run the test. We can perform more extensive or complex tests, with realistic data loads, as part of the nightly integration.
What to test during fast database CI
The whole point of the fast CI process is to get immediate feedback in the case of failures and issues with the code. This means not using data at all as part of the tests, or having the test itself load only the minimal amount of ‘static’ or ‘enumeration’ data it needs to run. It also means keeping the tests simple and direct in order to facilitate the speed of each integration.
The fast database CI process will verify that the database can be built, check for code policies, run ‘assertion’ tests on functions and so on. For example, we can run such tests as:
- Unit tests on stored procedures and functions – essentially, assertion tests that access a single database object (“If I pass in these parameter value, this is what I should get back“)
- Basic constraint tests – check that necessary table constraints, such as
CHECKconstraints are working properly
- Naming standards – do all objects, columns names and so on follow the specified naming convention?
- Standard code validations – such as ensuring the use of
SET NOCOUNT ONin stored procedures
Of course, without any real data in place, we cannot run tests to validate things such as data refactoring, or testing of ETL processes. We also may not test at this stage any processes that require complex multi-table interactions. Treat tests within the fast database CI the way data used to be treated in old-fashioned client server environments. Only run the tests you need and only when you need them. Run enough tests so that the each fast database CI run is useful for the immediate feedback it’s meant to provide.
When to run the fast database CI process
The tests that comprise the fast CI process must return results fast enough that the team can have confidence in running them many times during the day. As the name implies, the ultimate intent is that they run on every commit of any database change to version control. You can also have a fast database CI mechanism that fires the tests based on:
- Merge of changes within version control
- Branch within source control
This means, of course, that you must automate capture of changes within your database, or a set of changes, from your version control system so that on each commit, it auto-generates a script that will build a new database at the correct revision. The exact mechanisms for this will depend on whether you adopt the state-based or migrations approach to database builds and modifications, as discussed in the Database Versioning: State-based versus Migrations article.
Full database CI
The full (overnight) database CI process uses a database with a realistic and representative data load. For many tests, we can simply use standard data sets, loaded from flat files. For some tests, you may wish to use a cleansed and obfuscated version of production data. However you load it though, I’m an advocate using test data that is as close a match as possible the volume, values and distribution found in production data. The closer the match, the more likely that your tests will help you validate your code and processes prior to deploying to production.
The final section of this article discusses challenges surrounding provision and maintenance of test data sets, depending on test requirements, database size, regulatory restrictions, and more.
What to test during full database CI
With the database created and loaded with data, the full database CI process will encompass a much more comprehensive set of functionality test and integration tests.
It will, for example, test the following:
- End-to-end application processes that need access to many related tables and objects associated with these tables
- Complex table refactoring that requires careful preservation of data, as it migrates between tables, or between the old and new incarnation of a table
- Security testing, such as validating that the code being written doesn’t rely on excessive permissions that won’t exist when the application is run in production
- Functional tests of complex or problematic code
- ETL Processes
In short, any tests that you can run in this longer, scheduled process that will ensure improved quality on the code that you deliver to the next stage of the data lifecycle management process as a whole.
When to run the full database CI process
All the required steps for your full database CI process have to be automated within your CI server software. The full integration process is triggered on a regular schedule. Full database CI testing will necessarily take longer, due to the data loading steps for example, which is why it’s typical to run it nightly. Some teams might schedule it to run a few times a day if it’s viable, and if changes to the code or metadata are being made.
Again, I stress the need for every step of the CI process to be automated. If your database CI process isn’t completely repeatable, with a reliable set of results, then the feedback that you are getting from the process through the failure or success of an integration is rendered invalid.
It’s entirely possibly for a manual build or migration to succeed once, and then immediately after fail if you run it again manually and make a mistake. If you don’t make a mistake all you’ve tested is your ability to repeat a set of tasks. Automation removes that uncertainty and ensures that our feedback loop through the CI process is accurately measuring our ability generate meaningful and accurate deployments.
Of course, no amount of automated testing is a substitute for subsequent, often-manual QA tests, which ensure the business is happy with the software you’re delivering. An automated test can only check for problems you can predict and coverage is seldom 100%.
DLM Advantages of Database CI
Database Lifecycle Management aims to help to get your database changes out the door alongside your application changes in a timely manner, and in a manner that ensures protection of the information stored in the database.
A database CI process provides a mechanism that ensures immediate feedback on the changes you’re making to your databases. It lets you track your database builds migrations and testing and ensures that you’re testing your deployment process early in the process of software development, which will help ensure more accurate and successful deployments to production.
Instrumentation and error reporting
As discussed in detail in the Database Builds and Database Migrations articles, the need for the build process to be thoroughly instrumented, providing the details the developer needs, to pinpoint the exact cause of the problem, possibly including the full stack trace.
Exactly the same requirements extend to the CI processes. You should publish the mechanisms of the CI process, when the integrations run, which tests run and when, how you deal with the results, and so on, and make sure they are understood by everyone involved in the process. When a build fails, developers must receive automated notifications via email, messaging, or alerts directly within their IDE, and the system should provide log files and as much detail as possible about the cause of failure.
Failures should be reported to the team in real time. More than one person on the team may be responsible for a given failure. The feedback loop has to be universal for all your team, not just the person who has currently checked in some code. The development team leader needs to be able to delegate tasks within a project, including tasks within the CI process, using automated workflows to assign broken builds to specific development team members.
You should also make public the graphs and reports that depict your build frequency, the rate of success and the rate of failure, as well as the number and severity of the bugs you’ve captured through your CI process.
All of this data will demonstrate to developers, operations, governance alike what processes fail regularly, and therefore where to direct improvement efforts.
Auditing and Governance
The team need to consolidate in a centralized place the record of the integration process, such as statistics from various tools or build outputs. This would allow supervision of the integration pipeline and various metric associated with it. There ought to be an artefact repository to store all build outputs for easy access.
An example of why it is important is when you need an audit report of when a particular change happened and why. You could get some of this from version control, but the important point is when the change was first successfully integrated. The CI process has to be able to automate traceability and governance rules such as code policies. This will also help you arrive at better measures of technical debt and get a hand on delivery date estimates.
The CI process must be flexible enough to require additional tests such as functional tests, integration or performance tests, as deemed necessary, before release of a database from development into the deployment pipeline. Of course, performance and scalability tests will occur after the code is released for deployment to QA, Staging and beyond, but if you can do some of these in integration than the chances of an easy passage through the delivery pipeline is enhanced.
It’s all about determining the correct tests necessary to ensure that the CI process is assisting in your overall delivery of changes to the system. The correct tests will help to ensure that auditing and governance processes are well covered, early in the process. As with all the other tests run through the CI process, it’s about providing the right feedback, fast.
Faster issue resolution
If issues are discovered in production, IT Operations should be able to fall back to more stable builds that are retained in their central management server (see Release and Deployment article). In the meantime, many of the practices implemented as part of database CI will prove critical to a team’s ability to respond quickly to the issue to be able to put tests in place that avoid them recurring.
If you have the luxury of being able to use obfuscated production data as part of full data CI testing, then the team will need to become proficient in recreating the production systems in a safe environment. In a DLM system, it is quite possible that the operations team will take on the preparation of ‘production-scale’ databases, since they have the security access and clearance to see the real data whereas the developers would never have this. This is another incentive to involve the operations team in the integration testing.
The ideal case may be to have a Production-parallel environment in which all the events are synchronized except for those that cross the system boundaries, which are mocked with fake endpoints. With this in place, the initial reaction to a Production error can be to take a production ‘snapshot’, synchronize the production-parallel environment, and set about reproducing the problem. The team needs to reproduce the database in the state just before the error, which might involve undoing a transaction set or preparing some external data.
This done, we can write an integration test that fails at the earliest point that our problem is visible, whether in application state or in persisted, database state, in order to prevent regressions and to validate the fix.
More consistent and reliable database deployments
By repeatedly practicing during development how to integrate database changes and create a working database, at the correct version, you will be in a far stronger position to start releasing to QA much earlier in the development cycle, and therefore receive earlier feedback from DBAs and operations staff on potential security, performance, and compliance issues and so on.
The goal is that the database ‘package’ to promote a database from one version to another, which our CI process produces, will be used for all environments, from development through QA and Staging up to production.
As discussed in more detail in the Building Databases article, one step toward this goal is to have a clear separation between the data and the database configuration properties, specifying the latter in separate configuration files, tested by your CI processes, and providing them as part of the database release package.
Of course, there are other possible difficulties associated with promoting to other environments the database changes tested via our database CI process, such as consideration of SQL agent jobs, use of technologies such as replication, change data capture, and so on, which will not be present in the development environment. We’ll discuss these topics in the Release and Deployment article.
Common challenges with Database Continuous Integration
Databases present a unique set of problems to a CI process. The persistent nature of data is one of the biggest problems. Unlike code, you can’t simply throw away the old database and replace it with the new one without taking into account that there is, or may be, data within that database, that you’re going to need to retain as part of your deployment because it’s necessary information for the business. You can’t simply toss it away. If you throw in issues around having to deal with potentially large sets of data, or the need to create or duplicate data for testing, you’re adding considerable time to the database CI process as a whole.
You also have to deal with the fact that you can’t create a branch, a second copy of a complete database, in place, but have to deal with this through other means. You’re also going to be faced with cross-database or cross-server dependencies. In short, the same problems that application code has when dealing with databases during testing.
All these problems add to the complexity of the CI process when dealing with databases. The following sections of the book offer suggestions and mechanisms for dealing with these issues.
Mechanisms for providing test data during database CI
For many purposes, you will use standard test data sets, so that the data is the same on every test. The full database CI process may start with the outputs of the fast database CI process, and then import a data set from files. For example, the steps to test a single end-to-end business process, which accesses the database, might look as follows:
- Run the fast CI process (tears down existing database, creates a new build, runs tests)
- Import standard data set, for example using a tool such as BCP, or equivalent
- Test the business process
However, it will ease the passage of a database through the various “release gates” on the way to production, if the team can perform rigorous integration and acceptance testing with realistic “production-like” data from as early as possible in the development cycle.
Therefore, if possible, you may opt to start the full database CI process with a restore of the appropriate version of the production database, with data that has been appropriately cleaned in order to ensure compliance with any laws or regulations. The production version would then need to be migrated to the version under test. In this case, your full database CI process will need to automate database restoration, migration, teardown, (cleaning up after a test-run) and obfuscation (cleaning existing data to protect production systems). We’d then test all of the required business processes using the restored database.
Of course, depending on the size of the database and regulations regarding use of the data, it simply may not be possible to use ‘production data’, even obfuscated, in the development environment. The following sections discuss some of the challenges of obtaining and maintaining ‘production-like’ data sets, for CI testing.
Obtaining a ‘copy’ of production
There are a number of mechanisms for maintaining a copy of a production environment for High Availability/Disaster Recovery (HA/DR) purposes, such as mirroring, replication, or using Availability Groups within SQL Server. However, all these processes limit the target system from making structural and data changes because you can only read from them, not modify the structure or the data within them, which means that they are not suitable for our purposes within the CI process.
This leaves some type of backup and restore, or a process of moving data from the production database to your test databases, such as using import. The best mechanism here is to use the restore process to set up your CI database. However, you have an additional mechanism that you must perform: before you expose the production data through the process, you need to perform data obfuscation and cleansing.
If you have to meet regulatory requirements to restrict access to production information, then you can’t simply take a copy of your production backup and restore it down to your CI environment. Because we’re going to expose the results of failures of the CI process to the entire team, that alone could expose sensitive data. You also have to deal with that fact that you might be testing functionality such as sending an email based on data inside your database. You don’t want to chance exposing your clients to test emails. Both these issues suggest that you must have a mechanism in place that cleans the data prior to using within your CI process. There are three basic ways you could address this:
- Clean the production data as a part of the CI process
- Clean the production data in a separate process, preceding the actual CI process itself
- Use functional data masking mechanisms such as the data masks built into SQL Server 2016
Perhaps the simplest way is to have a completely separate process that cleanses your production data and then backs up the clean database for use within your CI process. The data cleansing process will modify or delete data as needed to meet any regulatory requirements. You need to remove any risk that your CI databases could still have sensitive information within them and could expose that information to unauthorized people as part of a subsequent deployment process. All the data cleansing mechanisms and processes should probably be developed by individuals within the organization who appropriately have permission to view production data.
Finally you can use internal processes exposed through the SQL Server database engine, such as Dynamic Data Masking, as a way to ensure unauthorized people, which can include your CI process, only ever see appropriate data.
As discussed previously, if you involve the operations team in integration testing, they may assist in providing preparation of ‘production-scale’ databases for CI testing, regulatory compliance permitting, in a safe and reliable manner.
Nevertheless, the use of ‘production’ data, may not be possible in some environments. Even cleansed and obfuscated data can sometimes reveal damaging facts, if that data is accessed illegally. In such cases, the security implications for using such data in a development environment would probably make it unviable. The alternative is to import from flat files data sets that mimic as closely as possible the characteristics of the production data.
If you have a terabyte or more of data, you’re just not going to want to go through the process of backing that up and restoring it, let alone attempting to cleanse it, in order to maintain your CI process. This is actually one of the more difficult problems to solve when working with databases within CI. Your solutions are frankly somewhat limited. You can take production data and, as part of your cleansing process, delete a substantial amount of data in order to shrink your database down to a reasonable size. You can choose to use data load mechanisms to create and maintain a set of test data. You can look to third party tools to provide you with a mechanism to create smaller databases through compression or other mechanisms.
Of course, there are drawbacks to this. You’re not going to have a complete set of production data, so you might miss certain unique aspects of data or data relationships within production in your testing. You’re also not seeing any issues with data movement and maintenance that are simply caused by the size of the data you’re moving around. Unfortunately, these issues are not easily addressed and may just be somewhat problematic for your CI process. In short, CI just isn’t going to solve or prevent every possible problem you might experience in your production environment.
Taking advantage of a third party tool to create small or instant copies of databases can alleviate these issues, but it adds overall complexity to the entire process of arriving at clean data ready for CI testing. However, it can be the best bet for keeping the size of the database down while still allowing for a full set of production data to be used for testing.
Time required to perform integration testing
The scheduled database integration tests ought to perform a complete test of modifications to your database schema and code (ideally with the application). These tests will also need to ensure that any migration scripts work correctly and preserve existing data. This means you will have to take into account the amount of time it takes to deal with your data as part of your full CI process.
You’ll have to measure and account for the time that a restore of your database takes. If you’re automating data cleansing as part of the CI process, then this also will have to be added to the full CI processing time. You’ll also have to allow time for data migration and for additional testing time, including validation that your data movement was successful and you haven’t lost any information.
All this suggests you need to attempt to keep the amount of data under your CI process as small as you can. Refer back to the issue with large data sets above. These are your primary mechanisms for getting the time down to complete your full CI process.
The time needed to deal with data is one of the main reasons I advocate so strongly for having a two-step approach to CI testing. The fast CI process ensures that you can have a continuous stream of fast validation of basic functionality. You can then do the work necessary to set up the full CI process and deal with time and data issues within a wider time-frame, supplying the feedback necessary to validate your code changes, but not negatively impacting your ability to get work done.
However, you can certainly move any and all types of tests that might involve larger data sets away from the CI process. Just understand that you’re making the choice to move your tests further away from the development process, which can, in some cases, make fixing issues more difficult and time consuming. The whole point of setting up a CI process is to provide a set of mechanisms for testing early in the development process. However, there is nothing that proscribes moving these tests out. Each environment will have unique challenges. You’ll have to figure out how best to deal with yours following the guidelines laid out here.
Branches within Source Control
Branching code within source control is common. The Database Version Control article will discuss this process in a lot more detail. Suffice to say, at least three branches are fairly common
- What’s in production now
- Hot fixes getting ready for production
- New development
As the application code branches into these three, or more, different sets of code, you’ll have to branch the database as well. The only mechanism I have found for branching a database is to create more than one copy of that database. This leaves you with only a couple of choices. You can create multiple databases within an instance of your database server, using different names, or you can create multiple database server instances. Your environment and approach will determine which of these is best for you and your processes. Just remember this is going to exacerbate the issues around the size of your databases since, instead of one 100 GB database, you’re going to have three 100 GB databases to move around.
It’s not uncommon to have a query that retrieves information from a database created by a different application. One could argue that secondary sets of data should only be accessed through a secondary service, thereby eliminating the need for a cross-database dependency. Until those services are created though, you might have to deal with multiple databases as a part of your CI process.
The first approach I take is to immediately adopt a mechanism supplied by application developers, I create a stub database. This is a database that only has the necessary tables, views, procedures and data that I need test the database that is going through my CI process. If my testing only requires a few tables or certain data from those tables, there’s no need in copying around whole databases. Just create the necessary information, in a minimal form, and maintain that as a part of your CI processing. Chances are you’ll only ever need this for your full (overnight) CI. If you have to add it for the fast, commit-based CI process, make it an empty or near empty database.
Another approach is to use ODBC links when dealing with cross-server queries. You can then easily mock them using text or Excel-based databases, even using T-SQL to query them.
On the other hand, if you find that you’re going to need complete copies of secondary databases, then you have to treat them in the same manner that you’re treating your database at the core of your CI process. You’ll need to get them into source control, if they’re not already there, and build a full CI process around them, if one doesn’t exist. This means dealing with their data, data cleansing, branching, and all the rest. Don’t forget that this will also add to the time necessary to complete your CI process, so be sure to take that into account when setting this up.
These are not the easiest types of problems to solve. It’s just best to take into account that you will need to solve them, depending on the requirements of your system.
Database Lifecycle Management aims to help to get your database changes delivered quickly, alongside your application changes, and in a manner that ensures protection of the information stored in the production version of the database.
Database Continuous Integration uses tooling and scripting to provide immediate feedback on the changes you’re making to your databases, and allow these changes to be more visible and testable. You can give the database all of the benefits of you get from application CI. It will mean that you can validate all database changes quickly, using a frequent, automated, repeatable process. It means you can reduce human error, as well as improve the database code and schema quality. It will reduce time between database releases, and above all ensure that the quality of those releases is high, with a dramatic reduction in the number of database-related issues reported in QA, Staging and Production. Perhaps most significantly, it will help the team adopt a culture of continuous improvement for the database, not just the application.
No one is suggesting that creating a Continuous Integration process for databases is simple or easy. However, it is absolutely doable. While databases present some unique challenges within a CI environment, none of these challenges are insurmountable. Just remember that you will have to deal with your production data in a responsible manner. Focus on the need to get feedback from the CI process as a way of improving the quality of the database code you’re delivering.