Database subsetting. What it is, what it’s for, and why we’d love to hear from you.

We’re looking into database subsetting, and if you need it or are investigating it, we want to hear from you. Please share your thoughts in the comment section below.

What is database subsetting?

Database subsetting is creating a copy of a database that contains only a portion of the data, while still being referentially intact.

For example, suppose you have a table in a database containing millions of customers, but you only want to work with a database that contains thousands. Simple enough, just copy over a fraction of the rows, right?

But then suppose you also have an orders table that contains references to the customers. You’ll need to take only those orders that are related to customers in your sample. As your schema grows more complicated, so does the process of taking a valid subset of the data, and the need for automating the process becomes greater.

Why would I want a subset of my data?

We think there are many reasons organizations need subsets of data for development work, and right now we’re looking at three use-cases:

Testing, development and training environments – developers often want to work with real data, but providing a copy of your entire production database would be too costly in time and disk space, especially if you need to provide a copy for each developer. Subsetting allows you to work with data that contains all the necessary links between tables for your programs to function, but for a fraction of the cost.

Multiple organizations or departments – you may have a database that contains data for a number of different organizations or departments, and providing the whole database would expose data to individuals without the correct permissions. Instead, you can provide a subset containing just the data that is relevant to a particular organization or department.

Old data – you may want to discard a portion of your data based on some criteria. For example, you might want to clear out data older than a certain date, or data which you can no longer keep in order to comply with regulations like the GDPR.

What do you think?

Do you agree with us? Can you think of any other scenarios where you might need database subsetting? Are there other aspects of the use-cases we’ve outlined that should be considered?

We want to gain a deeper and broader understanding of the demand for data subsetting, so if your business has a need for any of the above or you would be interested in data subsetting for any other reason, please share your thoughts in the comment section below.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter
  • Pingback: Dew Drop - June 21, 2018 (#2750) - Morning Dew()

  • Markoolio

    One reason to work with a subset of data is data protection requirements. You may use outsourced developers who don’t have high enough security clearance to access all the data.

  • IanB

    We use it at my company for testing purposes. There is no replacement for having representative data in your test/dev environments as it will highlight data patterns that you did not anticipate. Also useful to put real data through the battery of automated tests to make sure they still pass with real world data.
    Our biggest problem was not chasing down foreign key restrictions but generating the subset in a reasonable amount of time as we have to reduce about 22TB of data down to about 4TB.

  • Bri-Guy

    Didn’t Redgate buy just such a tool (www.databee.com) a few years ago? I am interested in data subsetting for development and QA databases because running dev/QA activities on full terabyte-sized copies of production databases can take a very long time.

  • JP Pienaar

    Another scenario that I want to use sub-setting for is to move data from Dev to QA to Production environements.
    Say for example, I want to capture user permission on my Dev environment and test how it will work when logging in with different user. If all is good, I want to move those exact users and permission to QA, test it there, and then move it to Prod in a similar way.

  • Ben

    There are a couple of good open source database subsetting tools out there, most notably:

    Jailer (http://jailer.sourceforge.net/home.htm) – It’s been around for a long time, seems battle hardened, but there’s a confusing amount of configuration points
    Condenser (https://github.com/TonicAI/condenser) – Simple configuration and under active development. Unfortunately, it’s Postgres only. Documented algorithm here: https://www.tonic.ai/blog/condenser-a-database-subsetting-tool
    RDBMS-Subsetter (https://github.com/18F/rdbms-subsetter) – Solid tool, when it works. Unfortunately, it doesn’t seem to be supported any more (https://github.com/18F/rdbms-subsetter/issues/45)

  • Adamos Kyriakou

    I’ve been searching for a good subsetting tool mostly for unit/integration-testing purposes against DBs that are just too complex to set up by code (without spending ages creating fake data). I found Jailer to be far too complex tbh. A good tool that’d allow exports to a dump file or even a raw SQL script with support for the most common DBs would be tops.

  • Adamos Kyriakou

    I’ve been searching for a good subsetting tool mostly for unit/integration-testing purposes against DBs that are just too complex to set up by code (without spending ages creating fake data). I found Jailer to be far too complex tbh. A good tool that’d allow exports to a dump file or even a raw SQL script with support for the most common DBs would be tops.

  • Paul Esposito

    Will Red-Gate be doing any integration with the Databee tool they purchased. My client has multiple red-gate licenses and would much rather use a red-gate integrated solution for sub-setting (Production data to lower environments). There requirement would be to not only take a subset of the data from production to test/dev, but to be able to obscure PII as it is copied from production to the lower environments.

  • Paul Esposito

    Will Red-Gate be doing any integration with the Databee tool they purchased. My client has multiple red-gate licenses and would much rather use a red-gate integrated solution for sub-setting (Production data to lower environments). There requirement would be to not only take a subset of the data from production to test/dev, but to be able to obscure PII as it is copied from production to the lower environments.

Related posts

Also in Database development

What are the biggest challenges in monitoring SQL Server, and what should you expect from a third-party tool?

Redgate has just published the 2019 State of SQL server Monitoring Report, so I thought I’d read it to find out if the results it reveals chime with my own experience talking to large, distributed e...

Also in Blog

Learning from the Accelerate “Four Key Metrics”

There's been a lot of excitement about the book Accelerate, which summarizes research from the past several years of the State of DevOps Report from DORA (which Redgate sponsors).

Perhaps the most po...

Also in DevOps

5 reasons to read IDC’s Executive Brief on Compliant Database DevOps

Ahead of our attendance at IDC’s DevOps Conference in September 2019, we approached Jen Thomson, IDC’s Senior Research Director & Lead for Accelerated App Delivery Practice, to help us articul...