Managing Test Data as a Database CI Component – Part 1

Constructing a test environment for your databases can be a difficult task at the best of times. Once you've actually acquired the hardware needed and architected the environment, you still have to arrange and securely transport the data. And with the rising demand for fast feedback and continuously integrated processes, having all of this automated and operating at speed is a challenge all of its own.

DevOps, Continuous Delivery & Database Lifecycle Management
Continuous Integration

In this two-part article we will review some methods for scrubbing data so that your users are protected, moving data so that there is as little impact to production as possible, and getting data so that your developers have what they need to test accurately and thoroughly. We’re also going to look at how you can start to automate this process and get fast, reliable feedback on your database development, which will set you up to take advantage of continuous integration for your databases and ultimately give you fine-grained visibility into how your database designs are performing.  In part 2, we’ll be going through some practical demos to show you how you can start automatically keeping your testing environments up to date, and shipping more robust databases as a result.

Why Do We Need (Regularly-Refreshed) Test Data?

Over the years I’ve been an advocate of quality over quantity, and by that I mean that I value the quality of software releases over the speed of software releases. All too often, each “hurried” release introduces new bugs, which requires a another quick release to fix those bugs…which introduces other bugs…which…I think you get the picture.

This is not necessarily to say that “speed” inherently equates to “poor quality”, however, “rushed” deployments, often result in…poor quality. In the current day and age we are rarely afforded the ability to focus solely on quality, as management seems to always want things delivered faster and faster. Under this pressure, two things become critical:

  • Being able to consistently and reliably test any changes to your database to make sure they work in production
  • Being able to test incremental changes to your database, quickly and automatically, so that no development time is wasted, and you have confidence and visibility into each the effects of each change.

In short, if you don’t have one already (and you really should), it becomes critical to design a staging environment that resembles your production environment as closely as possible, and then to start factoring your database into your organisation’s continuous integration (CI) process to help you get fast, regular feedback on database changes. Ideally your data and testing environment will assist you in deploying changes smoothly from staging to production, detecting and preventing bugs from going into production, and even troubleshooting existing issues.

Thus far I’ve only talked about the importance of fast, regular and robust tests in abstract. Of course, when you’re testing a database, the data is an integral part of how the system functions, so let’s paint a picture of what this might look like in the real world…

The Danger of Stagnant Test Data

As the DBA of an agile software development team, you finally get a staging server – it’s not quite as powerful as your production server, but at least you’ve finally got “a” server to use in testing. You’ve configured SQL Server, patched it to match production (or higher if you are staging an upgrade) and, now that you feel pretty good on the system side, it’s time to get down with the dev team and start hacking away. You restore production databases to staging, and take that long needed vacation to Hawaii (because this is what DBAs do, right?)

Three months and three releases later, a horrible pattern seems to manifest. Each release introduces new bugs, and previously working functions start to behave differently. Everything runs fine in staging, but it seems like once changes get deployed to production, things just don’t run the way they should do. What’s going on?

The first possibility is that your test data has gone stale, and developers are lacking up-to-date information that could have made a difference in their testing. For developers to appropriately address bugs and provide just-in-time fixes, they are going to need an up-to-date copy of the data. Because their data goes stale, most of your development team has quite possibly just gone back to testing in production, and your stress levels rise as database drift becomes an everyday problem to deal with. Does this sound familiar?

Another possibility might be that your dev team have only been testing with data that has made it’s way into the system through “normal” channels. What happens if random data are introduced? Your users wouldn’t be the first to occasionally just mash the keyboard. Will the system be able to handle it in all cases? Since the only testing was happening with existing data, bugs were introduced to systems that couldn’t handle unknown data patterns.

These two problems alone are enough to demonstrate the need for fresh test data. Your Development team is missing out on potential agility boosts by having a working test environment. Your QA team is missing out on the ability to perform meaningful regression tests. Your DBA is missing out on time they could be spending surfing the web optimizing your production stack.

Production Data and Generated Data

A common question when trying to populate your staging environment is whether to use production data or generated data, and the answer is the standard cop out – it depends on your needs!

Production Data for Current Problems

Production data is almost always necessary to efficiently troubleshoot a problem in real time (that is, a problem that real users are encountering right now.) If your users are experiencing bugs with the application, and you are using generated test data, you might find it rather difficult to zero in on the exact problem. Let’s imagine a scenario where you work for a medical billing company, and a client has discovered that some of their paid bills are not showing up in the portal as actually being paid. In this instance, you will need to look specifically at this actual client and their records in order to find the cause of the problem – artificial data isn’t going to shed much light on the issue.

Of course, if you are using production data then it is essential to refer to your company’s policy on data security (which should also factor in things like locale legislations). For instance, you probably don’t want to fill your staging environments with live credit card numbers or social security numbers, unless your staging systems have met all security and compliance requirements.

Generated Data for New Problems

That said, generated data is almost always necessary when you want to effectively test your code against different / unpredictable inputs. Let’s dig deeper into the scenario above, and pretend that the error came from a malformed string that was entered in at some point of the billing process. Perhaps someone in data entry cut and pasted from an excel spreadsheet and some unknown characters went in. Had your test environment utilized randomly generated data during testing, then your CI process might have discovered the problem during unit testing.

Generated data is, in my opinion, extremely important for unit testing, pre-production testing, or load testing of new features. Clearly for cases when you add a new table, you would need to generate data, but what if you just want to pump a table full of more random values to see how things react when there are more rows, or unpredictable inputs? What if you need to put a new feature under a lot of stress to predict how it will behave as you scale the application?

These are all situations where generated data excels, and is generally easier to manage than filtering and transforming sets of production data. I tend to favor randomized generated data so that we can see what happens when unknowns are introduced to the system, but most tools will at least keep the data in the same column data type of the destination table, and there are tools out there that will even factor in aspects of business logic (e.g. “shipping date” is always later than “purchase date”).

As you can see, both kinds of data are critical in thoroughly testing your database and your application, but you need to understand that both have different pros and cons, and are appropriate for different kinds of testing. Your job is to understand what kinds of tests you need to run against your database, and what kind of considerations you have to factor in, and then pick the right kind of data and data-loading processes. We’ll tackle some of that in a moment.

How Much Data Do You Need?

Ideally you can get a complete copy of the production data to restore on your local dev machine or staging server. However, this clearly is not a scalable solution – once your databases start getting above 500GB, it’s likely that the length of the restore process and machine space will become limiting issues. So how do we know how big of a sample size to take? I suppose that is the million dollar question (literally in some environments, as your testing could save or cost millions).

After talking with a few of my colleagues and peers, I’ve found there is no definitive answer when it comes to sizing your staging data, and it really does just depend on your situation. Some SAN vendors offer writeable snapshots which would be a great way to perform some testing on larger environments, and in other cases companies will devote terabytes of storage and virtual machines in order to capture the amount they need.

My personal approach is to design systems to be as modular as possible, so that we can bring in different parts of the system for testing as needed, and thus control the amount of data needed. For example, we currently have a database server that is using filetable to hold a terabyte of images. Instead of duplicating the terabyte, we simply restore everything but the filetable into the test or staging environment, and we are able to test everything except the existing images.

Another possible angle to look at is, if you are already testing your restores (and I really hope you are, for your sake), then it’s likely you have (or temporarily have) enough space to restore, test, rinse and repeat – right? Going back to the filetable example, we are currently deploying a restore server capable of holding the entire database. It would make sense, in this situation that when a restore is tested, development can access the restore and utilize it until the next automated restore. This way, we’re performing full tests across the application and database with a minimum of overhead.

In the end, it’s a decision I would advise on, but as it it impacts everything from hardware infrastructure, network traffic, and deployment, I’d ultimately leave it up to someone in management (i.e. the CIO or similar) to decide. Present them with the all the possible solutions, then let them make the decision that is right for your company.

Getting The Data…

Once you have some idea regarding what kind of tests you’re going to run, and therefore what kind of data (and how much) you need, the next step of this entire process is actually getting and preparing the data for use. There are a few approaches you could take: Restore from production, Replicate data from production, use SSIS to incrementally load, Generate random data from scratch, or Generate a consistent data set at each run. Lets break each of those down a little, and see what their strengths and weaknesses are.

Restoring from Production

Restoring from production is probably the easiest solution to go with: You take a backup, you relocate it to your staging server, you restore…and done. Other than provisioning space, there isn’t much else to do. This method can be automated with homebrew TSQL, but my own personal choice here is to utilize Red Gate SQL Backup, and have the restores scheduled. Restoring from production also provides you with real user data, and allows you to test your restores at the same time.

The drawbacks with this kind of loading can be dependent on the way the application is written, and how regularly or rapidly you need to refresh your data and run your tests. I have seen environments where IP addresses or server names are hardcoded into configuration tables, which can cause a problem for testing environments that are not segmented from the production network. In this case you would need to develop some post-restore scripts that take care of this. Regulatory compliance can also cause issues – any personal identification data would need to be scrubbed, and you would need to make sure you’ve also covered things like email addresses so that you don’t chance spamming all your production users while testing new email functions! These are also factors which you could take care of with your post-restore script.

Another drawback of restoring from production is that you will eventually need to do it again, and soon if you want your data to stay fresh. This can cause developer headaches since they will need to find a way (which usually means you will need to find a way) to save and restore any modified or new structures they have created. You’ll also have the challenge of actually getting the backups. Recently I was in a situation where I requested a 100GB backup and the IT team responsible for getting the data to me was unable to comply. The pipe between their data center and the home office was not sufficient to get me the file in a timely fashion. To make things even more difficult their security policies kept me from downloading such a file directly. In this case they ended up creating a virtual machine at the data center just so I could load and test the data.

Of course, there are a few other factors which you should consider, so let’s quickly round up some of the pros and cons of using restored backups from production for your test data.

Pros

Cons (or rather Things To Consider)

Simple & technically easy

Coupled to how the application is written (e.g. hard-coded config details)

Provides real customer data

Needs to respect regulatory compliance (sensitive data, email addresses)

Can piggy-back on your existing DR processes

Need to avoid blowing away changes on each restore

Easy to automate

Need to be able to move large files around (and quickly) OR have the flexibility to locate your testing / staging environment near to the data source.

 

Difficult to test new features.

 

Moving and restoring a backup is relatively slow (maintaining freshness and responsive test processes becomes a challenge)

Replication

Another possible route is to use transactional replication to replicate rows from production based on a filter, keeping your test environment trimmed with only the data you have deemed worthy to exist in your dev environment. This approach is relatively easy to configure, but it can place a load on the production server (most likely a nominal load, but load, nonetheless). A possible challenge to implementing this method is…introducing replication. It takes a special kind of DBA to know about and to want to deal with replication, and if yours is (or you are) already said kind of special, then by all means: have at it! If yours isn’t, then this can either be a great opportunity for them to learn, or it might be more efficient to look towards another method.

One thing you’ll want to keep in mind is that generating snapshots of production during production hours is usually a No Noâ¢. You will cause all kinds of havoc because the whole point of a snapshot is to provide a point in time copy, and so it will lock the tables. This solution can also be tricky to manage once you start introducing schema changes. In terms of automation, transactional replication is constant, so you will always be using an up to date copy of the data.

Pros

Cons (or rather Things To Consider)

Simple to configure with replication already set up

You need to set up and manage replication.

Provides real customer data

Coupled to how the application is written (e.g. hard-coded config details)

Can piggy-back on your existing DR processes

Needs to respect regulatory compliance (sensitive data, email addresses)

Transactional replication is always running (therefore, automated)

Need to avoid blowing away changes during replication

Keeps test data constantly fresh (fast, regular data imports into stage / test environments)

Places a (potentially nominal) load on the production server.

 

Difficult to test new features, or load-test systems with only small existing databases.

SSIS & Incremental Load

Using SSIS as an incremental load method is, in my opinion, one of the more elegant ways to maintain test and staging data. In this scenario you creatively craft various SSIS scripts to retrieve data from production based on a business-decided criteria, and load it into your test environment. These SSIS scripts could also include logic to trim the data, or even transform the data (if you are testing new schema designs). The amount of work this takes is fairly minimal – the example we use in part two of this article was developed in less than a day by our intern. The solution itself is relatively easy to maintain, and relatively easy to modify. One of the major benefits of using SSIS is that you can take this minimal solution and make it as complex as you want, as SSIS is very flexible and feature rich in terms of extract, transform and load.

Pros

Cons (or rather Things To Consider)

Can be simple depending on your requirements

Can be complex depending on your requirements

Provides real customer data

Needs to respect regulatory compliance (sensitive data, email addresses)

A flexible and feature-rich method, with potential to evolve with your environments and needs

Possible load on the server during the extract

Generating Data

Generating random data from scratch can easily be done with third party tools such as Red Gate SQL Data Generator or EMS Data Generator. I can happily recommend both, and they both have command line utilities that can allow you to automate a solution. Generating a consistent data set is another solution that can work well for you, depending on how sophisticated you get with it and what your testing needs are.

When you’re trying to test for unpredictable behaviour, the randomly generated data can be more useful that structured customer data. Likewise, as I mentioned earlier, if you’re trying to test a new feature, or to load-test a system for which only small amounts of data exist, the ability to create arbitrary amounts of artificial data (structured or random) can be very handy. That said, however useful random data can be, you need to be wary of inconsistencies between each sample of generated data. It’s possible that one generated sample might allow a set of tests to pass, while the next – generated from the same process – would cause them to fail.

Once you’ve decided on how to constrain and focus this approach, you could use a 3rd party tool to manage it for you, or you could just as easily devise a series of scripts to fill your database to the suggested sample size with all test cases of data that are known. This approach is another thing that you could use SSIS for.

Pros

Cons (or rather Things To Consider)

Simple to set up, simple to modify

Uses artificial data

No regulatory concerns

May require some lightweight configuration to tune the generated data to your needs

Zero  impact on your production server

Difficult to model your existing system and customer behaviours to a high degree of accuracy.

A flexible method, with potential to evolve with your environments and needs.

Inconsistencies between generated datasets may lead to sample refresh artificially passing tests.

Fast – provides fresh test data on demand

 

Easy to integrated directly into rapidly iterating CI tests.

 

Provides arbitrary data – ideal for testing new features, or load-testing lightly used systems.

 

Easy to automate

 

Which Method To Choose

As we’ve seen, there are multiple ways to generate and manage your test data, and which one you choose is something that you need to decide based on what kinds of systems you’re testing, and what kinds of tests you’re actually running.

If you need to test new features or run load-tests on lightly-used systems, then generated data (or some cunning SSIS transforms) might be what you need. Similarly, generated data is going to be useful if your team need to be able to perform regular, fast test runs throughout the day, and if the test cases are clearly known (bearing in mind that you could also craft known data samples from customer data for this purpose).

On the other hand, if you want to be able to run tests that require real customer data, and you still need a way to get fresh test data into your testing or staging environments at speed, then you might want to take a look at either replication or incremental loading with SSIS. Those two approaches have different pros and cons, so you’ll have to balance those against what you can do and what you need.

Of course, if the speed and size of your data pipe aren’t huge issues, and you want to be able to test an accurate reflection of production, then a simple restore from your latest production backup should do the trick.

Realistically, you might want to use a mix of several of these methods to complement each other. For example you might use generated data and SSIS scripts for your team’s fast unit, regression and lightweight integration tests, and a full restore of your production database for your comprehensive overnight test suite, when time is less of concern, but test quality and realistic results are!

As you can imagine, given how easy all of these methods are to automate (more on that in the next article), they can all be integrated into your team’s continuous integration process to give test coverage over the database. For example, your continuous integration testing could involve routines that call a restore job or SSIS package to update or generate data, then the tests get fired off and you wait for the results while sipping on a nice big cup of Win.

What Next?

By now you should have a pretty good idea of what your options are when it comes to generating data for your database testing, and hopefully you’ll be able to start making decisions about which approaches to try. I’ve shown you some of the benefits and caveats of various data transportation solutions, as well as when you might want to use them. We’ve discussed regulatory compliance considerations, as well as the impact each solution may have on production.

In part two of this article, we’ll examine automated restores using Red Gate SQL Backup, and a SSIS solution for getting a data sample, transforming sensitive data, generating random data, and keeping the data up to date. By the time we’re finished, you’ll be ready to start managing your database test data smoothly and automatically. We’ll have paved the way for you to effortlessly manage your testing and staging environments, freeing you to think about how to integrate with your team’s agile and continuous integration processes, and how to get that big cup of “Win”.

DevOps, Continuous Delivery & Database Lifecycle Management
Go to the Simple Talk library to find more articles, or visit www.red-gate.com/solutions for more information on the benefits of extending DevOps practices to SQL Server databases.