How to back up and restore in PostgreSQL: Learning PostgreSQL with Grant

Comments 0

Share to social media

This is part of an ongoing series of post by Grant as he takes his vast knowledge of SQL Server and applies it to adding PostgreSQL and shares it with you so you can skip learn from his triumphs and mistakes. For more you can go to the Learning PostgreSQL with Grant series home page

One of the first things I worry about when I’m looking at a new system is the status of the backups. I don’t see anything in PostgreSQL that will lead me to a different conclusion. However, I didn’t realize just how big a can of worms I would open by pursuing backups and restores.

Instead of talking immediately about backups, let’s talk about restores.

Restore strategy

There’s a saying that, as far as I know, originated with Kimberly Tripp: Your backups are only as good as your last restore. In short, you have to take backups, but what’s important is the ability to restore the data. You can have an infinite number of backups, but if you can’t restore even one of them, that’s nothing but wasted space. So, before you start thinking about backups, you should be thinking about how you want to restore your systems.

Restoring a database is an enormous topic. This article could get into all sorts of permutations about all the different methods for restoring a database. However, to keep the size of this article to something reasonable, and because I haven’t learned everything I need to just yet, I’m going to focus on one restore mechanism: restore from a backup. It’s also possible to restore to a point-in-time, but I’m saving the details of how to do that for a future article.

A simple restore from a backup is meant to get the database back to the state it was in right at the point that a backup was taken. This kind of restore operation is acceptable for development and test systems. It’s an easy way to quickly migrate a database to a new server. However, it is limited in that you can only restore to the point of the backup. If data or structural changes occurred after the backup that you’d like to recover, you just can’t do it.

This is where the point in time restore comes into play. Generally, most organizations frown on data loss. For example, taking a backup once a day means that, if a problem occurs about 12 hours into the day, you’re going to lose all that data. A point-in-time recovery means you can get the last backup and all the changes to the database since that backup. Point-in-time recoveries are accomplished through the write ahead log (WAL) maintained by PostgreSQL.

The WAL is automatically created and maintained by your servers. You can configure the size and number of files in the WAL. In theory, an infinite number of WAL files can be played back to arrive at a point in time. However, that brings up another aspect of a restore strategy, recovery time.

A recovery strategy must be built to define how you would like to restore and how long you want it to take. Development systems usually require taking a quick database backup and you’re done. Production systems need point-in-time restores, so you probably want a database backup that requires managing the WAL files.

With the requirements defined, you can then walk over and pick up the right backup tool.

Database back up

If simple database restores are all you’re interested in, then the best tool for the job is pg_dump. The core syntax is simple:

However, because I’m running my databases, either in containers or out in the cloud, this simple syntax won’t work because the service is running under a different login than what I’m using to connect remotely. Instead, I can modify the command slightly:

This command will work because I’m supplying the user that runs the service.

When running PostgreSQL in Docker, you must open a shell command to the container like this where PostgreHome is the name of the container:

Then running the original syntax just works, and the backup file will be found in the directory from where you are running the command unless you supply a path.

When running the command supplying the user from the external command line pointing it to the container, it’s going to demand a password. Once supplied, though, again, it just works.

You can exercise more control by setting various options in pg_dump, and I’ll discuss several of these a little later. However, one option to immediately consider, especially when thinking about how you intend to restore the database, is to change the output. By default, pg_dump goes to a plain SQL file. Pg_dump can also go into a compressed, digital file format by adding -Fc which is the custom format:

Using either of these two methods, you can back up your database. Now let’s talk about running a restore.

Database recovery

The output from the default pg_dump command is interesting. However, it gets very long, very quickly. I have a very simple database created from other scripts in this series so far. Here’s a small part of the output from pg_dump with the default format that you can see by using CAT:

It keeps going from there. In short, pg_dump is scripting out the database, structures, and data, into a file. It’s an SQL file that you can then run from psql or any other tool. You’ll get the objects and data created using the SQL from the pg_dump process. Since it does include data, this file could be extremely large, depending on the database.

Recovery then is just a question of replaying the scripts, in other words, just executing the SQL. You could open it in the GUI of your choice, or you could take advantage of the command line utility like this:

Of course, you would have to ensure that you have an empty database, otherwise, this will attempt to rebuild and reinsert data that’s already there. You can drop the existing database, create a new one and then run the scripts, or you can drop all the objects from the database, and then run the scripts.

If you choose to get the custom file format, you won’t have a simple set of SQL scripts, so replaying them requires a new command, pg_restore.

The base command to use pg_restore is just as simple as pg_dump:

The same rules apply. The database must be either a new database or one you’ve cleaned unless you want to take advantage of the additional functionality I talked about when using the custom file format. For example, you could restore to a database that still has all its objects inside. You can choose to drop the existing objects first by modifying the command as follows:

The command -c, or --clean, will drop objects first, however, you could run into error messages if you add a new object after the backup was taken. You can also tell pg_restore to create the database first:

The -C, or --create, command will try to create a postgrelearning database. If one exists, you’ll get an error, so, you can again modify the pg_restore command:

Running both --clean and --create will result in pg_restore dropping and recreating your database. Note that you will initially connect to a different database with the -d option and that the restore will use the database name from the dump file.

While running the SQL scripts of a basic pg_dump command is simple, you don’t have much control. You will find there are quite a few options if you use the custom file format and pg_restore. Using pg_restore you can:

  • Restore just the data and not the schema
  • Restore only the schema and no data
  • Output the contents of the dump file as a list of objects
  • Define a specific list of objects to restore
  • Disable triggers
  • Use multiple concurrent jobs to make the restore run faster
  • Pick and choose individual objects to restore

There are a number of other behaviors you can control in pg_restore that don’t apply to a getting started article. However, to summarize, you get a lot more control over exactly what gets restored using pg_restore and the custom dump format instead of the standard SQL.

pg_dump options

I already introduced one option available to pg_dump, the custom file format. Several additional options enable you to control what, and how, things get backed up.

The options are defined in the command line as has been previously shown. You can use abbreviations like -a, which means only dump the data for a given database, or you can use the full option, --data_only. Some options are complete, and some require you to supply additional information.

For example, you can specify that the dump is ready to clean the database as part of the restore:

Now when you run the scripts, it will include DROP commands.

You can use include or exclude patterns to only move some of the objects in a database. For example, to exclude schemas that start with the string dev in them, you could run this:

You also have commands to make it possible to:

  • Backup just the schema without data
  • Backup only the data, without a schema definition included
  • Define multiple concurrent jobs, so the backup occurs faster
  • Add in conditional commands, IF EXISTS, to clean up

And a whole lot more.

How to back up all the databases

It’s fairly likely that you have more than one database. You can script out pg_dump so that you run it over and over, specifying which database to backup and where it should go. If you have a very customized backup process where each database is unique, this is probably the way to go. However, you also have the option to use pg_dumpall. Instead of one database at a time, this will export all the databases on a given server to a single file. The pg_dumpall command is basically just a wrapper around pg_dump. However, since all databases are being dumped, pg_dumpall will also gather dependent objects for the databases that may be in a system table.

The problem with pg_dumpall is that it only outputs to the standard SQL format. You can’t get the compressed custom format. This also means you don’t get all the functionality of pg_restore. As before, this means that you can only restore to empty databases.

Other back up mechanisms

Like almost any major RDBMS, there are third-party solutions out there for PostgreSQL. Also, if you’re running AWS RDS, AWS Aurora, Azure Database for PostgreSQL, or GCP SQL Database for PostgreSQL, they all come with built-in backup and recovery mechanisms. You can also backup and restore from PGAdmin4 and restore from Azure Data Studio. Plus, there are multiple scripted solutions available through the PostgreSQL community. I would strongly recommend learning how to use the native tools to backup and restore your databases. However, once you’ve built the skill set, then moving to a different solution would be fine.

Conclusion

The ability to restore a PostgreSQL database is critical. Getting a quick backup is quite easy with the pg_dump command, and it’s most suited for development. This command creates a script file by default that you can play. There are several other options, including -Fc to create a backup with a custom format that can be used with the pg_restore command. The pg_dump command provides several options, including the ability to drop and create the database. If you need to back up all the databases, use the pg_dumpall command, however, it’s not as flexible.

Production databases usually need to be able to recover to a point-in-time, encrypt the backup, and there are often other requirements, so other options like WAL or a third-party tool must be used. I’ve purposely saved the point-in-time restore discussion for another article because it is quite complex compared to pg_dump.

Load comments

About the author

Grant Fritchey

See Profile

Grant Fritchey is a Data Platform MVP with over 30 years' experience in IT, including time spent in support and development. He has worked with SQL Server since 6.0 back in 1995. He has also developed in VB, VB.NET, C#, and Java. Grant has written books for Apress and Simple-Talk. Grant presents at conferences and user groups, large and small, all over the world. Grant volunteers for PASS and is on the Board of Directors as the Immediate Past President. He joined Redgate Software as a product advocate January 2011.

Grant Fritchey's contributions