Getting Started with Cloning in Redgate Test Data Manager

This article takes a 'first look' at database cloning in Redgate Test Data Manager, explaining what it does and its advantages in team-based, test-driven database development. It will get you set up with the cloning CLI and then demos the basics of defining and creating images, and then creating and destroying database containers (clones) using PowerShell.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

The biggest ‘time barrier’ to writing robust code, especially for a large or complex database system, is often the chore of preparing a live database, at the right version with the right data. A developer will need to do this repeatedly, to test code thoroughly before committing their work on a branch. Can’t you just restore a backup? Well, that’s what many of us currently do, and backups are handy if your RDBMS has a good backup and restore system. However, they still take time, and they need a methodical frame of mind to organize the storage. You’ll need a lot of disk space because you’ll be storing much the same information repeatedly. Although you can do many other tricks with clones, the developer will appreciate them for the way they make this chore easy, particularly if your database is properly versioned, using a tool such as Flyway.

This is the first in a series of articles where I’ll concentrate on the developer’s perspective, exploring how to automate the use of clones in Flyway development. This first article though is purely about getting up and running with rgclone, the database provisioning, or “cloning”, component of Redgate Test Data Manager.

I’ll assume the required server environment is already installed on the network or has been made available for you. I’ll show how to set up the rgclone CLI, locally, create empty data images for SQL Server, PostgreSQL, Oracle and MySQL and from them deliver clones. In subsequent articles, I’ll explain and demo how to start using these clones with Flyway.

How does Cloning work in Test Data Manager?

Here, I’ll describe only the basics of how cloning works – for more, see the docs. If you were previously familiar with SQL Clone, then the underlying concepts are similar, but the implementation is different. Whereas a clone in SQL Clone is SQL Server-only and behaves like a database within an existing server, a clone in Test Data Manager is cross-RDBMS, not just SQL Server, and behaves like a fully-functional database instance.

When you use the cloning (rgclone) component of Redgate Test Data Manager, it delivers a database instance running in a Docker container. It’s a live instance just like any other, complete with all its databases and data. However, rgclone uses data virtualization to remove the need to copy large database files into the container, every time we need to create or reset an instance.

It creates a complete, read-only image of the database files of a database instance (the data image). Each data container (clone) then uses data from that image. As developers make subsequent changes to each data container, altering either the data or design as required to produce the next version of the database, it is only these changes that require extra storage space, on disk (the changes are written to a diff file, for each container). This means that even if there are several very large databases on an instance, the rgclone CLI can spring up multiple copies of that instance, each as database container, very quickly, because it requires far fewer network resources and much less initial disk space.

The following image shows a simplified version of the cloning service architecture, for an installation on an Azure Kubernetes Cluster. See the Clone section of the documentation for more details.

Redgate Clone architecture

To make the right version and data easy to find, you just tag a combination of a version of the design (metadata) and a dataset of the database, and it’ll be there when you require it. It is also much quicker to fetch because it is a single command to create a data container and make it available.

When and why do we need it?

With rgclone we can create, remove or reset a database instance, at the required version, very quickly, even one containing multiple, very large and interconnected databases. This has some obvious advantages.

For teams that are running several lines of development, such as when developing new features in parallel, use of clones makes the branching and merging much easier. It is also far quicker because it can spin up the database(s) for each branch on demand, at the required version. Test-driven development with databases becomes manageable, without the chore of building the database to the correct version for every test, loading all the test data, and then tearing down the test harness after you’ve run all the tests.

Unlike a traditional database server, data containers are designed for test-driven development and are especially helpful in the test phase of a development project that needs a volatile database that can be spun up and torn down, repeatedly.

Of course, there are a few challenges to overcome. Current IDEs aren’t designed for working on ‘disposable’ databases, where the password is constantly changing, and we need to make sure that authentication doesn’t become a distraction, in an automated approach. We’ll tackle this subject in later articles.

Getting stuck in

All the data images and data containers, plus the file share for backup storage (the source for data images) are located on the cloning service (shown above). All you need to set up on your local machine is the rgclone CLI. You can then connect to the remote service to create data images and data containers.

The developer will generally use Bash or PowerShell to automate clone provisioning. To make it easier to do interactive work, the rgclone CLI supports Bash’s auto-completion and prompting, so if you are using Bash you needn’t look up the correct parameters or create a crib sheet to remember the commands. However, I’ll use PowerShell instead for this demo because I’m more interested in automating processes: I get bored easily when doing repetitive work.

Setting up the rgclone CLI

We assume that your friendly and accommodating admin team has set up the Cloning service. They’ll need to provide the URL of this service (the RGCLONE_API_ENDPOINT), and a means to authenticate, generally via OpenID Connect (OIDC), for interactive work, but perhaps using an access token for automated processes.

Here is a PowerShell script that will set up the rgclone CLI for you, hopefully. It will need elevated permissions to run if you need to set user-level or machine-level environment variables. It sets up, if necessary, a machine-wide environment variable that provides the RGCLONE_API_ENDPOINT. Then it detects whether you have rgclone.exe client. If not, it downloads it and installs it.

#Set the rgclone Endpoint as an environment variable if it doesn't exist
if ($env:RGCLONE_API_ENDPOINT -eq $null)
    {[System.Environment]::SetEnvironmentVariable('RGCLONE_API_ENDPOINT',
          'https://clone-external.red-gate.com',
          [System.EnvironmentVariableTarget]::Machine)
    }
# if you haven't got rgclone client installed on your workstation
if ((get-command 'rgclone' -ErrorAction Ignore) -eq $null)
    {
    $DestinationPath = "$env:ProgramFiles\Red Gate\RGClone"
    if (!(Test-Path "$DestinationPath\RGClone.exe" -PathType Leaf)) {
       
         Invoke-WebRequest  '
        -Uri "$env:RGCLONE_API_ENDPOINT/cloning-api/download/cli/windows-amd64"  '
        -OutFile "$env:Temp\windows-amd64.zip"
        #We'll set up the install directory if it doesn't yet exist
        if (!(Test-Path $DestinationPath -PathType Container)) {
            New-Item -ItemType Directory -Force -Path $DestinationPath
        }
        #It is zipped so unzip it.
        Expand-Archive -LiteralPath "$env:Temp\windows-amd64.zip" -DestinationPath $DestinationPath
        Del "$env:Temp\windows-amd64.zip"
        }
    #if you haven't got the path defined for RGClone
    if (!(($env:path -split ';'|where {$_ -eq $DestinationPath}) -eq $DestinationPath))
    { $env:Path += ";$DestinationPath" #in the persistent user-level Path
    # for Teamwork framework users -- add-path -LiteralPath $DestinationPath -scope 'user' 
    }
    }
# Set the location where we store the details, including logins. This must be in the user area
# and preferably encrypted. 
$WorkDirectory="$env:USERPROFILE\RGClone"

After that, all you need to do is authenticate via the CLI. If you’re using OIDC, then simply run rgclone auth without any flags. Unfortunately, browser-based OIDC authentication can be tedious. With the configuration I use, the system demands repeated authentication, even in a single session. Token-based authentication, where each user, whether human or automated scheduler, has its own token, is essential for automated processes.

Data images and data containers

The two ‘resources’ in rgclone are data images and data containers. A data image is a complete, point-in-time copy of the database files of a database instance, which is made from a YAML template file. This template defines the static source of the image, either empty, i.e., without any databases, or one or more backup files. It then specifies the necessary information to generate (a.k.a. spawn) a working database instance, running in a data container, such as the type and version of relational database system you want to use.

If you want an empty database set up in a particular way, as is likely with Flyway, then you can either use an empty database backup as the source for the image, in the YAML template file, or just create an empty image but provide a ‘prescript’ as a parameter, with the CREATE DATABASE code followed by DDL code that specifies all the settings for the database defaults.

Once you have a database container set up as you need it, then you can make a template from it, but we’ll leave that for a later article.

Defining data images

To create a data image, you need a ‘data image definition’, which is just a YAML file with a few parameters specified within it. The required parameters are:

  • name: Data image names are case sensitive (sigh!) and can contain letters, digits, and the special characters, hyphen (–), underscore (_) and dot (.). They must contain more than just a number and be shorter than 63 characters.
  • sourceType is either empty or backup. If you choose empty, you’ll need to create any databases that are required for the data image via a Prescript (See below). If you choose backup, you’ll also need to specify the backups parameter.
  • backups – a list of backups to restore into the image. Because the application uses Linux filesystem you need to use Linux format for backup file paths (i.e. use / as path delimiters)
  • engine is the RDBMS, currently one of mssql, postgresql, mysql or oracle.
  • version is the version of the chosen engine that you wish to install.

As well as these, there are other optional parameters

  • tags – must be lower case strings. These are handy for integrating with source control or with Flyway. They are useful for holding any changeable data.
  • Lifetime – The lifetime of the image. This defaults to ‘never’, meaning images are never killed. This parameter doesn’t refer to the data containers it produces, but the image itself.
  • Prescript – the SQL script to run before any backup is restored, such as to create the initial database environment.
  • Postscript – the SQL script to run after any backup is restored or after the prescript has been run

Use of prescripts or postscripts brings quite a bit of versatility to the process of creating a data image. During image creation,rgclone will:

  1. Spin up an empty database instance
  2. Run any defined prescript, perhaps creating one or more databases
  3. Restore any source backup files
  4. Run any defined Postscript, perhaps adding tables or data to restored databases, or modifying user permissions
  5. Save all the database files, modified or otherwise, into the data image.

As well as the data, the database files in the data image contain the metadata and so contain all the necessary information to generate, or spawn, databases in rgclone. You can’t use data images directly, though. You use them to create data containers.

Creating an empty data image

The following code writes our data image definition (template) to a YAML file in our work directory and then uses it to create a data image.

Here, we are creating a data image that will behave just like a conventional MSSQL database server on the network. Because we’ve used the empty option, we’ve just got an empty image. When we create a data container from this image, we’ll get a running instance of SQL Server 2019 with no user databases.

Creating a data image from a backup

After you’ve got over your initial excitement, you’ll discover that you can add to this simple spec to create an image that specifies a point-in-time copy of the database files of a database instance, created from a backup file. For example, here is the template to create a database image with a database in it, by specifying a backup as the sourcetype.

Creating data containers

We can use these images to create data containers, live clones of a data image. From a single data image, you can create as many data containers as you want very quickly. They will start out as being exact replicas with the same data. It is an idea to collect the output of the following command because this will tell you how to connect to the container. However, you can also get this information whenever you need it using the rgclone get data-container command.

rgclone create data-container --image dev-mssql -o yaml >  "$WorkDirectory\dev-ImageMsSql.yaml"
Type "$WorkDirectory\dev-ImageMsSql.yaml"

You can give the container a name, which is case-sensitive. It can contain letters, digits, and the special characters, hyphen (–), underscore (_) and dot (.). Names must include some letters.

If you specify a lifetime for a data container it will automatically be deleted after the time specified, in time duration format (number followed by a unit suffix. Valid units are “h”, “m”, “s”). However, if you don’t specify a value, your data container will use the lifetime specified in the rgclone provisioner. If your data container is aging fast, you can extend its lifetime like this:

rgclone update data-container <ContainerName_Or_ContainerID> --lifetime <LifetimeDurationString>

To rename an existing data container use the --name option:

rgclone update data-container <OldContainerName_Or_ContainerID> --name <NewContainerName>

Once we have all the data containers set up, we can use the output from them for connecting our GUIs to them. We can also create or overwrite the Flyway conf files that Flyway uses to make a connection. Fortunately, we can also do this by getting the information directly from rgclone, as we’ll demonstrate in another article

Things you can do with a data container

The most obvious next step once you have created a database server is to use it via an IDE. Fortunately, you can either retrieve your connection details from the output of the command when you created the resource, or by using:

rgclone get data-containers --output wide

If you specify json or yaml as the output format then you also get the information about how to connect, in a format that is easier for an application to consume. The connection details are in both ODBC and JDBC format. I’ve never had any problems in making a connection.

One connected, you can make changes to the databases on the container instance, save it as a new revision, reset it to a previous revision, load it with data from a previous revision or graduate it (which means ‘create a data image’ from it’), and once you’re done, delete it.

Clearing out your images and containers

When you start out with rgclone, it makes a lot of sense to try things out as part of the learning process. Unsurprisingly, the system is designed to avoid making it possible to accidentally bulk-remove all the containers and images. However, at this stage it is essential

<# tear-down #>
#We delete all our containers. We must do this before the images
$Containerlist=rgclone get data-containers -o json|Convertfrom-json|foreach{$_.id}
RGclone delete data-containers $Containerlist
#We delete all our Images
$Imagelist=rgclone get data-images -o json|Convertfrom-json|foreach{$_.id}
RGclone delete data-images $Imagelist

Conclusion

From my own use of cloning in Test Data Manager, I’ve discovered that it speeds up development work with Flyway and cuts out some of the distractions. I can see it as being immediately useful for automating much of database testing, especially when using Flyway with different relational databases systems, some of which don’t allow rollback. Also, for provisioning Flyway databases when running integration or regression tests in parallel.

In a team, every developer works with the same environment, avoiding the irritating problems that come from differences in local setups. Each data container is isolated, allowing developers to work on different projects or versions at the same time, without conflicts.

Data containers are very quick to create, reset or destroy. This allows the developer to spin up new lightweight instances for testing or development without long setup times or relying on someone else to do the provisioning.

Suddenly, it becomes easy to rollback, in development, or to flit between different database versions to determine when a bug started happening and what suspect changes were made in that database version.

There is, I think, a certain culture shock in changing development habits to take advantage of working with data containers, but the freedoms they provide are compelling in a team development. For me, most of all, it removes hassle and distractions, and allows me to focus on development work.

Where Next?

In my next article, called (at least in my head), Automating Cloning with Flyway: Oh! Da Clone, we’ll create four data images, all for the same database but on four different RDBMSs: SQL Server, PostgreSQL, Oracle and MySQL. From each image, we’ll create a data container (clone) as a working database instance. We’ll then use Flyway, in a single automated process, to connect to each one in turn and run a single script, in the appropriate dialect, to migrate each database to the required version.

Tools in this post

Redgate Clone

Provision virtualized clones of databases in seconds

Find out more

Redgate Test Data Manager

Reliable and secure test data provisioning

Find out more