Managing Test Data for Database Development

In coming up with a strategy for managing test data, first you need to understand what the many test requirements are likely to be for the particular database you are developing. Next, you need to create all the different categories of test data required to suite test requirements and, finally, work out how to manage it in a way that allows a test-driven approach to development.

When learning about relational databases, we all tend to use ‘toy’ databases such as Pubs, AdventureWorks, NorthWind, or ClassicModels. This is fine, but it is too easy to assume that one can then do real-world database development in the same way. You have your database full of data and just cut code that you then test. From a distance, it all seems so easy.

In fact, rapid and effective database development usually requires a much more active approach to data. You need to work out how to test your work as you go, and to test continuously. For that, you need appropriate data with the right characteristics, in the suitable quantity. You also need to plan how to ensure that, when you make changes to the database, or even minor changes to its settings, all business processes continue to work correctly. In Agile terms you need a test-first methodology, fast feedback loop, and iterative development. You should never cut some SQL Code and only then think to yourself “I wonder how I’ll be able to test this?“.

Imagine, for example, that I need to create a routine that does a ‘sentiment analysis’ of customer feedback. Before I start coding, I’ll need a good test harness and the right test data sets, meaning carefully curated data to suit the requirements of the tests. Could I use entirely generated data? Of course not, because I’m investigating human behavior. Would I use masked data from customers? No, because there is always a risk that some fact mentioned in a message could still identify the customer. In any case, if it is a new feature, there may be insufficient existing data.

I would need a plan before I start. I would use as much customer feedback as possible that is in the public domain – a bottomless pit of human sentiment. I’d get each message rated by a panel of experts. Success would need to be measured and tested by the non-parametric correlation between the machine’s rating and the panel’s rating. Test data isn’t homogenous, like baked beans or oxtail soup. It must conform to the recipe appropriate to the task and that is a special skill, and probably explains why test-driven database development isn’t as prevalent as it should be.

To develop reliable databases, we must test as we develop. To do that, we need a ‘canteen’ system of managing development data that provides developers with immediate access to the appropriate data with which to test. Without getting the data first, and a criterion for success, there can be little progress.

Principles and challenges common to all test-data management

Once, when I was doing a database-driven application for a vast retail bank, I was told that it takes twice as long to test an application as it does to do the development work. That’s true, but only if you don’t plan your test data requirements very carefully, and if testing is postponed until release. The task can be daunting, but with good management and curation of the right sort of data for each stage of the process, testing will not hold up development significantly.

To explain why so many different types of data are necessary, for each of the various phases of creating a working database, we’ll need to spell out why the curation and management of data for testing is such an important task.

Sourcing the data

It isn’t usually prudent or effective to use live data for routine testing for two reasons. Firstly, a production-based dataset doesn’t allow us to check whether the database functionality reacts appropriately with ‘boundary condition’ data and ‘negative’ data (which can include both unanticipated but valid data as well as illegitimate data), because it is already ‘filtered’. It would be like testing your ability to sort sheep from goats when all you have is sheep.

Secondly, it can present data protection and security issues that aren’t always easy to solve. It’s a hard task to ensure that sensitive or confidential information is appropriately masked or anonymized to comply with privacy regulations. Masking can be successful but must avoid skewing the distribution and must prevent ‘inference attack’. This requires time and effort. Sometimes, it is tricky to fix a bug that only happens with the live system, but it pays to use the standard procedure of doing this within the staging DMZ environment using, when required by law or industry-standard, only staff who have had the necessary security screening.

There are occasions when there is no alternative to using anonymized live data for development work. Often, this will be a representative subset extracted from the entire data in the production environment. This reduces the amount of data you need to store and load while maintaining the integrity and relevance for testing purposes. To do this, data must be masked in various ways to anonymize it sufficiently to comply with data protection regulations and to safeguard data privacy. However, data that has been masked is, technically speaking, just pseudonymized unless the foreign table references are also juggled to prevent inference attacks. The conditions under which pseudonymized data is held are considerably stricter than for fully anonymized data.

Data distributions and means

The data that you use for testing must be as similar as possible to the data that is used, stored, or transformed by the database. It is relatively easy to measure the distribution of the live data in a production database and to mimic this distribution in the generated data. All the popular relational databases use data distribution and statistics to optimise the query plan for each SQL query. While the implementation of this optimisation varies between relational databases, it is always there. For numeric data, it isn’t just the distribution that must be right, but also the mean or average value. A date field, for example, would cause problems if was wildly out. Quantities in a shopping basket would need to be realistic.

If you don’t provide a realistic workload while testing, you are likely to miss necessary indexes or provide indexes that are worse than useless. ‘Tuning’ has to be done with a realistic workload which is why it is that it is so often only in the Staging part of deployment that performance problems surface. One might think that one could therefore just postpone performance tuning to Staging, but unfortunately a procedure or function can easily contain code that expects a particular distribution of data for a process. If developed with incorrectly distributed test data, this can slow to a crawl when presented with real data.

Understanding the database constraints

A well-designed relational database will have constraints for the data that can define the range of valid entries for each column (or combination of columns). They prevent bad data from reaching the database tables. Obviously, the data that you generate must conform with these constraints, otherwise the data import will fail. This means that if you opt for use of a data generation tool, it must be able to generate data that complies with these constraints. The constraints that cause trouble will be FOREIGN KEY constraints and CHECK constraints, which can be defined at the table-level, and involve several columns.

Special data types

Not all databases stick to the classic relational data types, especially when it comes to ‘document data’ such as XML or JSON. Geographic data types can also be tricky, and image data too can cause problems.

Row-level relationships

The intrinsic problem of data generation and masking is that the data within a row has its own relationships. People are born before they die. In a database of suppliers, a French manufacturer will likely have a French address. Row-level constraints deal with the problems of checking for inconsistent data within a single record, so on a well-constructed database, you may even be unable to load masked or generated data that isn’t consistent.

Data formats

You’ll be testing data inputs, perhaps simulating a connection from an application that uses a JDBC or ODBC connection. The data could be required by the test rig in one of a range of formats such as JSON, CSV or XML, depending on the computer language or script being used as a test rig. Inputs like this will be very different from table data because they will be the data as it existed in the application(s) before your validation and verification checks that you perform as it is transformed into relational data and inserted into tables.

Post-processed data

Just like a quiz, when testing a process, you need to be able to verify that you got the ‘right answer’. This means being able to compare the data after it is transformed by the process with a dataset that represents what the post-processed data should look like, to ensure that they match.

For example, |if you are creating invoices, dispatch notes and so on from a shopping cart, you need the correct versions, the ‘right answer’, to check against. These datasets of ‘post-processed’ data should be routinely checked for their accuracy.

Data storage

It may come as a surprise that this can become an issue. The problem is that there is no common, defined standard for storing datasets. CSV, the most prevalent medium used, has never been defined in an official industry standard. This means that it isn’t always usable between RDBMSs because of differences in the way that the comma and inverted commas are ‘escaped’ between RDBMSs. There is no common understanding of ‘escaping’ CSV between tools, applications, scripts and databases. Using the tab-delimited version of CSV can avoid this, if supported, unless you use the tab delimiter within data, of course.

Even after surmounting all this, the way that complex data such as images is rendered in text form can vary between RDBMSs. Some database systems, such as SQL Server, have clever, compressed proprietary file types that are fast and easy to load. This is fine up to the point where you need to load the data into an application in order to modify or update it, as required for data management, because they can’t be read into anything but a database using the same RDBMS. Heaven only knows how you would ever write test data in that format other than using an intermediary database.

What categories of test data do I need?

Even when working on a new database feature in isolation (i.e., in a branch), there is a wide range of requirements for test data, depending on what you aim to achieve. The test regime for integration and regression testing is very different from the requirements of unit testing.

For the former, you need standard data sets that allow you to check that the result of any process produces the correct result. You can’t do this with data that changes between test sessions because the post-processed dataset that represents the correct result can only created by hard graft, expertise and checking. For example, the result generated by the database application must be checked either against a result generated by a human or by an entirely independent, verified process.

The same goes for performance testing. You can’t ensure that the new features you’re about to deploy won’t degrade performance, unless you measure on standard data under the same circumstances, just changing the database code.

Just about any type of test will need test data that includes unexpected or unlikely inputs. Database developers are constantly surprised by the ways that data can shame them unless they are careful to include ‘negative’ and ‘boundary’ test data. Specialist Test Engineers have rich collections of such data that can humble any developer.

The following are the main categories of tests data we’ll need to be able to generate and manage:

Positive test data: this is data that can be checked before and after a database process to ensure that the process has changed the data correctly. This is done to ensure that the system behaves correctly under normal conditions.

Negative test data: A characteristic of any type of test data is that it needs to include ‘negative data’ to test your assumptions about what the data is likely to include. There should be both unexpected input values and invalid or ‘bad’ input”. For example, a test dataset of personal names should always include a few ‘rogue’ (but legitimate) surnames such as Null or O’Brien. Negative test data is designed to identify potential vulnerabilities, boundary cases, and the resilience of the database. The tests will use this data to ensure that unexpected data is handled correctly, and that the database is protected from bad data by means of ‘sanity checks’ and database constraints.

Boundary test data: This type of test-data uses values that test the boundaries of the business rules for acceptable ranges or limits, and the range of values that make sense and can be accommodated. The test will use this data to ensure that the database handles these boundary conditions correctly. Boundary testing will assess the database’s ability to validate data, check ranges and operate boundary constraints.

Process check data: This is the data that is used, mainly for integration tests, to check that a database process actually does what it is intended to do. Normally, it includes the data as it is before the process, and the data after the process is executed correctly.

Penetration test data: Security test data has little in common with other types. It will involve strings that could expose SQL Injection vulnerabilities, lists of overused passwords, and many data tricks used by hackers to take control of a database. It is used to assess the security measures implemented in the database and other software components or systems. As well as the more obvious SQL injection tests, it will check for robust credentials, and sensible access controls. The aim of this sort of testing is to identify vulnerabilities, potential breaches, and ensure compliance with security standards and regulations.

Realistic test data: This is positive data, perhaps seasoned with some negative and boundary data, but realistic in both appearance and distribution. In order to check a database’s indexing strategies, for example, the data must have the same characteristics as the data that the production system will handle. For this to be useful, it must be available in varying quantities for the various demands of performance testing, stress testing and integration testing.

What are the different types of tests?

I cover the various types of test and their data requirements in more detail in this article on using Flyway: Testing Databases: What’s Required?

Unit test: This focusses on the features under development. The unit test ensures that the feature behaves as expected according to the specification under all conditions, including bad data, or even penetration attempts. A unit test harness will usually be stored with the build code for the components being tested so that changes and bug-fixes can be done quickly.

Regression Test: Whenever changes created in a branch are merged into the parent database development branch, there is always a chance that it will introduce unintended side effects or regressions. To check for this, the existing functionality must be tested by running every process to ensure that the existing features of the database continue to work correctly. Existing tests are reused, with existing standard data. This aims to catch any defects or issues that may have been introduced as a result of the merge.

Integration Test: This type of test will involve end-to-end checks of the database’s integration with other participating databases, applications, middleware, APIs, or external systems. To make sure that the database functions as expected in a broader system context.

Performance Test: Performance test-data is used to measure and evaluate the database’s performance characteristics, such as response time, throughput, and resource utilization. It includes test data that represents the expected usage patterns and workloads. Performance testing helps identify performance bottlenecks, scalability limitations, and optimization opportunities.

Stress Test: Stress test data is designed to evaluate the performance, scalability, and robustness of the database under high load or stressful conditions. It involves simulating large volumes of data, concurrent user activity, or extreme data scenarios to assess how the database handles such stress factors. Stress testing helps identify performance bottlenecks, resource limitations, and potential issues under heavy workloads.

So, what’s involved in Test Data Management?

The demands of continuous integration and release make it increasingly likely that database teams will aim to make testing part of the general development culture rather than delegating to a specialist role. This requires important tasks to be done as part of database development, particularly with the management of the test data and supporting the use of continuous testing.

Test data generation

This is the task of providing the test data in the first place, a process we’ve discussed at some length already. It is only part of the broader task of test data management.

Test data maintenance

Test data has to be managed because it must be reused on every release. It takes time and effort to produce, to check, and to slice into subsets when necessary. It will also need to be updated and corrected when necessary. It will need to be changed to conform with changes in the current version of the database.

Test data validation

There are two types of validation that are required. The primary one is to create the data that is used to check whether a database process was successful in producing the result that is considered by all the ‘stakeholders’ to be correct. The other type of data validation checks that the data meets the test requirements of the development team. The most obvious requirement is ‘completeness’ in terms of whether it is able to test for all the errors that could conceivably happen.

Test data archiving

Each version of a database will have differences in that data that is inserted into it, because tables are amended, added and removed. If you need to work on a previous version of a database for, by way of an example, tracking down the history of a bug, you need the appropriate data as well as the source.

Test data security

An important task for test engineers nowadays is to ensure that data anonymization via masking is effective. This requires exploratory tests to ensure that it is impossible to identify any individuals within the database by means of ‘inference attacks’ where individuals have two unusual characteristics that together can identify them, and by doing so can lead to further identifications.

Summary

Test data management has become more complex because of tighter legislation concerning the use of personal, financial, or sensitive data for database development. However, it is also due to the better integration of specialist roles into the development process. This has led to greater and more sophisticated demands for a ‘canteen’ system for accessing the test data required for development and debugging. My article, Test Data Management for Database Migrations, explores in more detail how a canteen system for TDM might work in a migrations-based approach, using a tool like Flyway.

Database development has moved on from a time when database testing was done as a separate activity from database development. Although it’s a conceptually simple model, it led to a wilderness of failed releases. It also didn’t encourage the move towards developing the techniques necessary to support test-driven development of databases, which was important for rapid releasing.

Developers need test support right through the typical development cycle, especially for unit testing and integration testing. Both types of tests require almost all of the range of test data and creating these test data sets requires the skills and cunning of the test specialist. By encouraging Test specialists to facilitate testing throughout the development cycle, rather than working in isolation, development teams were able to move to a system that made it easier to create and use test harnesses, to check a release candidate in advance of a release, and to provide rapid patches for test failures.

Not only has the development cycle become more rapid, but unit test data now serves as supplementary documentation of the expected behavior of the database code. By having a comprehensive suite of such tests, it becomes easier for developers to understand the codebase, collaborate with other team members, and make changes confidently without fear of breaking existing functionality.

Further reading

  1. But the Database Worked in Development! Preventing Broken Constraints
  2. But the Database Worked in Development! Checking for Duplicates
  3. Getting normally distributed random numbers in TSQL
  4. A series of articles on statistical calculations in SQL:
    1. Statistics in SQL: Pearson’s Correlation
    2. Statistics in SQL: Kendall’s Tau Rank Correlation
    3. Statistics in SQL: Simple Linear Regressions
    4. Statistics in SQL: The Kruskal–Wallis Test
    5. Statistics in SQL: The Mann–Whitney U Test
    6. Statistics in SQL: Student’s T Test
  5. My current set of articles about test data management during Flyway development.