What is subsetting, what are the advantages, and how does it make test data management easier?
As data grows and databases become larger and more complicated, data subsetting provides a method of working with a smaller, lighter copy of a database to make development and testing faster and easier.
But what exactly is it, how and why are developers using it – or not using it, and what’s prompting conversations about it?
The big question first. The way that I would describe data subsetting is to talk about the database like it’s a carrot cake, because that’s an easy analogy to grasp. Now, most people don’t want to eat the whole cake; they just want a slice of the cake that has all the different layers in it, the icing on top, and maybe a few walnuts as well. In cake terms, it has the same look, taste and texture as the whole cake. In database terms, it has the same referential integrity.
I see subsetting as exactly that because as a tester or developer I don’t necessarily always want to test against the whole database which is big and unwieldy with multiple terabytes. Instead, I want a representative slice of the database for me to test against. It’s small, it’s light, it’s easy to handle and if I drop it I can get another slice.
And that’s the same for any database. If you’re working with any of the big four like SQL Server, PostgreSQL, MySQL or Oracle, subsetting will resolve the challenge developers typically face when working with large databases, or trying to set up specific scenarios to test against.
The advantages of subsetting
We did a survey recently and talked to over 500 people about the advantages of subsetting, why they use it and how they use it. The major outcome was that database copies are typically too large and cumbersome to work with. Using subsetting and that slice of a database, developers can speed up the testing process and be more efficient.
Another common driver that is linked but slightly different is where there are lots of different parts to an application and a developer just wants to test, say, the checkout page or the loading page. They could take a subset of the database that just has data behind it for the checkout page or the loading page and only test those things.
So it helps make things smaller but also helps developers narrow down and really test against specific scenarios that come up as well.
The challenges of subsetting
A lot of the people we spoke to have home-rolled subsetting solutions, where they’ve come up with different ways of filtering their databases to create a smaller version. When they need to refresh it, they have a manual method that’s particular to their own development and test environments and it’s a complicated process.
The problem is that there’s no external support when there are problems and the process fails, and quite often if an individual leaves, teams are left with a legacy product no one really knows how to use. It relies on local environments, tribal memory, and it’s rarely robust, reliable and repeatable.
The alternative is … not to use subsetting and continue to struggle with really large databases, with developers trying their best to make do and run tests against them. In this case, they quite often come up against issues and find it hard to be able to do anything productive with testing. They often simply have to accept it and say: ‘Aah, it takes four hours to run this test.’
How subsetting makes test data management easier
Test Data Management (TDM) is an area that’s attracting a lot of attention because organizations are now facing many of the same challenges with their data. The volume and complexity of data is increasing, and providing database copies for use in development and testing is time-consuming and inefficient. Alongside this, outdated and low-quality data can lead to failed releases, and there is limited control and management of Personal Identifiable Information (PII).
With TDM in place, developers can spin up databases faster because they’re using small copies that are narrower in scope. Each developer can also be given their own test environment to avoid clashes and prevent them overwriting each other’s code. They can then refresh database copies more often because they’re lighter and faster to create and, in best-case scenarios, they can self-serve those copies themselves.
Subsetting comes into the picture in a big way because it helps to ease the process yet further. It makes provisioning truly representative copies of databases easier. They’re also quicker to refresh on a rolling basis so that up-to-date copies are always available. And where masking of PII is required, it’s faster and simpler and can be included as part of an automated and auditable TDM process.
It’s about taking a big thing, making it less complex, and using best practices in DevOps to achieve that nice triangle of speed, cost, and quality.
Learn more about how subsetting and Test Data Management can transform your software delivery by signing up to this informative webinar hosted by David Gummer, Redgate CPO, and Alex Yates, Senior Solutions Specialist at Redgate.