Using Common Table Expressions: Transforming and Analyzing Data in PostgreSQL, Part 2

This article is part of a three-part series on transforming and analyzing data in PostgreSQL. For more articles in this series, click here.

In the first article in this transforming data series, I discussed how powerful PostgreSQL can be in ingesting and transforming data for analysis. Over the last few decades, this was traditionally done with a methodology called Extract-Transform-Load (ETL) which usually requires external tools. The goal of ETL is to do the transformation work outside of the database and only import the final form of data that is needed for further analysis and reporting.

However, as databases have improved and matured, there are more capabilities to do much of the raw data transformation inside of the database. In doing so, we flip the process just slightly so that we Extract-Load-Transform (ELT), focusing on getting the raw data into the database and transforming it internally. In many circumstances this can dramatically improve the iteration of development because we can use SQL rather than external tools.

While ELT won’t be able to replace every transformation workload, understanding how to do the work can help improve many data transformation and analysis workloads.

To demonstrate how SQL and PostgreSQL functions can be used to transform raw data directly in the database, in the first article I used sample data from the Advent of Code 2023, Day 7. By the end of the first article, I had demonstrated how to take the sample input and transform it into a usable table of data that could be queried and analyzed. If you haven’t read that article first, it’s best to start there because you’ll be able to load the sample data, understand the puzzle we are trying to solve, and some of the unique PostgreSQL features that improve the process.

To get setup so that you can follow along, this simple script will create the ‘dec07’ table we need and insert a few rows of sample data. In the first article, I demonstrated two ways to do this that are more practical when dealing with raw input. This is just intended to get you started quickly.

By the end of the first article, we had this query.

Which, if you execute will return the following:

Finally, throughout that first article, all the examples that relied on multiple transformations had to use a derived table because we hadn’t discussed Common Table Expressions (CTE) and their usefulness as a tool in SQL to breakup steps in the query and analysis process.

In fact, the next part of solving the Day 7 puzzle involves converting the face value of each card into a point value. Without using a CTE, I can think of at least three ways to take the table above and add a new column with the point value of each card in the hand.

  • Joining multiple derived tables
  • Joining to a static data table
  • Creating a function that takes the card face and returns a point value

Each of these methods are either more complicated to write than necessary or require maintaining additional, separate code and data.

For example, if we tried to do this work with multiple derived tables, it might look something like this:

And executing this code, you would see the following truncated dataset:

This does perform the next transformation we need, but the query is already becoming difficult to read and nearly impossible to reuse components if needed.

Common Table Expressions to the Rescue

Let me start by acknowledging a trap that I often fall into. Any time that I learn a new feature or skill in SQL (or any technology), I can quickly start to overuse it. For instance, when I first learned about the array datatype in PostgreSQL, I started seeing every transformation and analysis problem as an opportunity to utilize arrays. As the saying goes, “When all you have is a hammer, everything looks like a nail.”

Sometimes, CTE’s can become a hammer in search of something to do, even if some other tool would work better. While using them often improves the readability of a long, complicated query, they can degrade performance because of their implementation for each platform or because storing data in temporary tables, for instance, might have been a better solution.

That said, I find CTE’s incredibly helpful when iterating over a data analysis problem without committing to a schema that’s bound to change through the process. And they help me see how each step will transform the data along the way.

CTE’s: The Basics

The idea of a CTE is straightforward. Wrap the output of a query in a named object (essentially a temporary VIEW) that can be referenced as a table later in the query, either inside additional CTE’s or the final statement. CTE’s always have at least one named object and a final statement (ie. SELECT, INSERT, DELETE) that references the CTE. Also, a CTE always starts with the keyword WITH.

A very simple example might look like the following:

This returns:

As you can see, the CTE is named sensors and then we can select from this named object later in the query. In this case, we reference it in a simple SELECT statement.

What’s more, you can have multiple CTE’s that can be referenced anywhere further down the query. When you have multiple CTE’s, they are separated by a comma, except for the last CTE before the actual query. For example, we could have two CTE’s, one that returns static data and a second that returns readings and then joins them in the final query.

Executing this will return:

Remember, once declared, any CTE can be referenced as a table in any query that follows, either inside a following CTE or the final query.

With those basics explained, let’s move back to the query at the beginning of this article that took the sample Day 7 puzzle input and converted card faces into a numerical value. This time, we will wrap each query in a CTE, which has a name, and allows us to write a clean final SELECT statement.

Which returns the following truncated results:

Easy enough! There is one more change we can make to improve readability and maintenance of the CTE.

When a CTE is defined we can preset the column name aliases outside of the query, rather than creating aliases in the query itself. This is particularly helpful when transforming data because it saves the effort of aliasing each column name or function output inside the query every time.

CTE column aliases are defined inside parenthesis between the name and the ‘AS’ keyword.

Hopefully this demonstrates the main reason many people use CTE’s; readability.

Let me say one more time that readability doesn’t mean performant. SQL Server, for instance, doesn’t materialize (cache) the query result of a CTE. Instead, the SQL from the CTE is inlined into the query that references it. In PostgreSQL, CTE’s can be inlined (the default starting with PostgreSQL 12) or materialized for easy and fast reusability. (This is chosen by the optimizer, unless you specifically need to force it. You can read more about this here in the documentation.)

In many situations writing a complex query as a CTE can be very helpful and probably relatively performant. If you’re doing a lot of querying across large datasets, just consider if there’s an alternative approach once the CTE query doesn’t perform as it used to.

Easier Debugging

The second reason I personally like to use CTE’s when I’m working on a data transformation is because I can always insert a SELECT statement after any CTE to see the output. Oftentimes as I’m building a query through multiple transformations, something doesn’t work, and I have to remind myself what the output of the source CTE is that I’m referencing later in the query. It’s also a great way to teach others how to build a query over multiple steps.

The next step in solving the puzzle from Day 7 is to identify how many cards of the same face exist in each hand. The first hand in our sample data was ‘32T3K’ which has two ‘3’ cards and one of ‘2’,’T’,’K’. These values will be used later to order the final output based on card values and groupings, high to low.

To accomplish this, we’ll add a third CTE called ‘card_counts’.

This will return the following truncated results:

Notice that the final SELECT statement only references the new ‘card_counts’ CTE. That’s because the ‘card_counts’ CTE already references the other two CTE’s in its query.

Next, we need to aggregate the card counts with the card values to determine the order of the hands, lowest to highest. Notice, however, that I simply comment out the `SELECT * FROM card_counts;` statement and build the next query after that.

This returns:

The final step to solving this puzzle is to number the order of rows, multiply the row number by the bid amount, and then sum all those values.

And the output is just the one value!

Again, notice that I have added SELECT statements after each CTE as I work. Most of the time they’re commented out. However, when something about the output along the way doesn’t seem correct, I can insert (or uncomment) a `SELECT * FROM…` after any CTE to check the results at each step.

Being able to do this kind of incremental debugging in a static query with many derived tables would be almost impossible. CTE’s simplify the process.

Conclusion

In this article, I have shown you the basics of how CTEs work and how they make coding easier by letting you format data in one query without storing data step by step.

In the final entry in this series, I will show you how to expand the use of CTEs to allow recursive queries. This allows you to work with data that needs to be iterated over, for examples hierarchies.