Transforming and Analyzing Data in PostgreSQL

This article is part of a three-part series on transforming and analyzing data in PostgreSQL. For more articles in this series, click here.

In our data hungry world, knowing how to effectively load and transform data from various sources is a highly valued skill. Over the last couple of years, I’ve learned how useful many of the data manipulation functions in PostgreSQL can supercharge your data transformation and analysis process, using just PostgreSQL and SQL.

For the last couple of decades, “Extract Transform Load” (ETL) has been the primary method for manipulating and analyzing the results. In most cases, ETL relies on an external toolset to help acquire different forms of data, slicing and dicing it into a form suitable for relational databases, and then inserting the results into your database of choice. Once it’s in the destination table with a relational schema, querying and analyzing it is much easier.

There are advantages to using a dedicated transformation tool in your workflow. Typically, a small team learns the software and handles all this work, so licensing is clear, and the best tools often allow reusable components that make it easier to build transformation pipelines. However, there are also drawbacks. Most notably, your ability to manipulate the data is tied to an extra tool, and often one that only the small team knows how to use.

What if you could flip the process around a bit and load the data into PostgreSQL before you transform it, using the power of SQL?

We call this the “Extract Load Transform” (ELT) method. And PostgreSQL happens to be very well suited for doing some pretty complex transformations if you know a few basic principles.

In this series, we’re going to discuss:

  • PostgreSQL functions that can slice and dice raw data to make it easier to query
  • How to use those functions with a CROSS JOIN LATERAL to unlock the power of these functions.
  • Using CTEs to build complex queries with functions and CROSS JOIN LATERAL

I hope by the end of this series you’ll appreciate the capabilities of PostgreSQL and some of the advantages for doing data transformation within the database. And although I won’t talk much about some of the challenges with doing ELT in the database, there are a few things to keep in mind.

First, if you import the data into PostgreSQL first, you’ll need additional storage. Typically, an external tool crunches all of the information and only loads the fully transformed data. You’ll need to plan accordingly for the impact on storage. That said, there are many PostgreSQL extensions that can be used to read text files, data-specific files like Parquet, other databases, S3 buckets, and more as external tables. This could be one way to limit the amount of extra storage you will need.

Second, there could be a similar problem to using external tools – a lack of developers that know how to develop and maintain the scripts. However, the money saved on licensing an external tool could free up budget to increase the skills and capabilities of the team.

Depending on your situation, there may be other issues you have to consider, but for now, let’s start digging into the fun of transforming data with PostgreSQL.

Preparing Data for Analysis

To demonstrate many of the things over the next few articles, I’ll be using data from the Advent of Code 2022 and 2023. This is a yearly coding challenge that runs from December 1-25. With each puzzle, a small sample of test data is provided to verify any solution before processing the full puzzle input. There are typically two methods I use to make the data available in a format that can be queried using SQL.

When I first begin to process data from a text or CSV file, my goal is to make some sample data available in a form that I can build my larger query on top of. This allows me to try different approaches to transform the data to solve the problem without committing to a final schema for the output. If I first need to design and create staging tables and tables to store the transformed data into, I lose some ability to iterate ideas quickly.

One other thing I should mention. In almost every case that I’ve done this kind of data transformation inside the database, I make sure to add an autoincrementing ID column that keeps the order of the lines as they were inserted. More often than not, it will end up being very helpful to know the order, or at least have a simple unique key.

To demonstrate the two methods, I’ll be using the sample data from Day 7 of the Advent of Code 2023. The puzzle input is lines of characters in a text file, separated by a space. The first part of each line represents a hand of five cards. The second part is a bid value for that hand to be used later in the calculation to solve the puzzle.

Although this might look like five rows of data from a table, it’s just lines from a text file. To query the data with SQL, it first must be made available in a table format of some kind.

Depending on the data and my needs, these are typically the two methods I utilize for turning this sample data into table output that I can query. While there may be other methods, either of these allow me to get to the SQL part very quickly rather than worrying about transforming the data outside of the database first, as you would with traditional ETL.

Note: For the sake of focusing only on the query components, we will assume the format of the data is correct so that our queries will not have to handle the vast number of issues that can arise when dealing data from a text file. We will assume that the process that creates the files has guaranteed the proper format (something you should only trust if your organization has control over that software!).

Method 1: Query Output as Source Table

I usually copy a few lines into a Common Table Expression (CTE) and then create a table out of that data that I can query. Since we haven’t discussed CTE’s yet, I’ll demonstrate this method with a derived table with the user of a subquery. As we move through the next few articles, the goal will be to simplify each iteration of the query as we explore more tools in the PostgreSQL and SQL toolbelt. I want the output of this query to mimic whatever the staging table might look like. In many cases, that might simply be an autoincrementing ID column and a column to store each line from the text file.

To do this, I typically turn to the output of a table function to act like it’s the source table. In PostgreSQL, we can use functions like regexp_split_to_table() or split_to_table() (PostgreSQL 14+) along with the multiline dollar quoting that PostgreSQL supports.

Copying our sample text above, we can place it inside either function and split on the carriage return. To add the ID column, we might use something like the ROW_NUMBER() function using the OVER() clause. In later examples, we’ll show a simpler way to return the incremental position without using the ROW_NUMBER() window function.

Executing this code returns:

This output is treated as a table that I can begin to build my larger transformation query over. With this method, it’s easy to copy different sample data into the subselect (or the CTE we’ll show in the next article) without having to truncate a table and reimport new data every time. Additionally, if you decide that it would be better to do some processing of the input data as it’s saved to the staging table you will likely create with the final process, this first query can be modified quickly. Otherwise, the staging table would need to be altered (or dropped and recreated) which slows the iteration process.

Method 2: COPY the File Contents to a Staging Table

Another approach is to insert the lines of the file into a table from the start. Even if you don’t know what the final staging table might look like, you can just as easily start with a minimum viable schema to get your transformation process started. This has the advantage of beginning your query iterations using a staging table similar to what will be used with the final set of real data. In PostgreSQL, the fastest way to insert lines from a text or CSV file into a table is usually the COPY command.

Then, from psql or another tool of your choice, insert the text file into that simple table.

Alternative if you want to follow along

If you want to follow along but aren’t comfortable with COPY or the derived table methods, I’ve provided a simple multi-value insert statement below that you can use. This is only intended to quickly insert this sample data into the ‘dec07’ table. You would not normally transform a text file into multi-valued statements in real ELT situations. Use one of the other methods.

After creating the ‘dec07’ table above, run:

Regardless of which method you choose, we can now start to query the basic text data using SQL to begin the transformation process.

Table Functions

Method 1 in the previous section referred to Table Functions, specifically regexp_split_to_table(). Table functions, sometimes referred to as Set Returning Functions (SRF), produce a set of rows and can either be used as an output column on the SELECT clause of a query or as a table, view, or subquery in the FROM clause. Many of these functions can be used to transform data with RegEx patterns, pivot into tables, or extract data to arrays. When used as a source table in the FROM clause, there are a few other helpful features unique to PostgreSQL that we’ll discuss later in this article.

Now that we have our sample data being returned as a simple table of data, let’s look at two different table functions that can help us transform the pieces further as we work towards a solution: split_part() and regexp_* (Regular Expression) functions.

Recall that our data has two different pieces of information on each row that we need to extract: the hand of five cards and the bid amount for that hand.

The split_part function can be used to split a string on a given character and return an item from a specific position in the resulting data. Using this function twice, we can get the cards from the first position and the bid from the second. Note that most objects in PostgreSQL that can be referenced by position are one-based (1) rather than zero-based (0) like many other programing languages.

Note that the data is a set of playing cards in a 5-character string, and a bid for a card game that we will use to show how to extract the data from that original record format in a file.

This returns:

Extracting this data into a table is straightforward. It won’t be the most efficient approach on a large string or many rows of data, but it’s a quick iteration and it helps us to move onto the next step of transformation quickly.

An alternative approach is to use one of the regexp functions to extract parts of the string using regular expressions. All regexp_* functions return the extracted parts as an array, a datatype not often found in other database systems.

This returns:

Since the output from this regexp function is an array, further use of the data would require us to reference each item individually. Therefore, to get the same output as the previous example using split_part(), where the cards and bid had individual columns, we could use this regexp_match query as the source in a derived table (and eventually a CTE once we talk more about them in the next article).

Returning:

A slightly more complicated query, but the result is the same. There are in fact many other table functions that could be used to extract the data you need from the input. I won’t cover them here, but the PostgreSQL documentation is a great place to read up on text and JSON specific functions which can be used to extract data. The more you transform data, the more you’ll learn about the various PostgreSQL functions that can help you perform simple and complex transformations.

Transforming the Table Data

Now that the input text data has been transformed into table form with the respective information extracted to columns, we can continue to process the data further as we work toward calculating the eventual solution.

Part of the requirement for figuring out the final calculation for this puzzle requires knowing the specific position of the cards in each hand, from left to right. One common way of doing this is to convert the string of text into rows of data, adding the ordinal position of each card to the output.

Once again, some PostgreSQL functions can help us do that.

Using the string_to_table() function, we can take any string input and split it into rows, kind of like a pivot. In this case, we need to split the hand of cards character by character, so we tell the string_to_table() function to split on NULL. If we wanted to split on spaces, hyphens, or another string, simply replace the NULL with the character string that is required for your use case.

This results in the following output:

This gets us part of the way there, but we still need to get the position of each card within each hand. One option would be to make this query the source of a derived table and use ROW_NUMBER() OVER().

This returns the following truncated results, with the new position columns added:

With a derived table and a window function, we’re well on our way to having data we can analyze further.

The array version of the source data requires a little more work to break apart the cards in each hand and get their position. That’s because the ROW_NUMBER() window function has to be used on the final table in order to partition by hand. The simplest way to do that is by adding yet another derived table. It achieves the same result, but it’s certainly not as nice to look at.

This is the output (truncated for space):

While both of these queries “work”, I don’t like how complicated they are becoming just to add the position of the card within each hand. Fortunately, the table functions in PostgreSQL can help us even further, ultimately simplifying the SQL.

For that, we first need to learn about CROSS JOIN LATERAL.

CROSS JOIN LATERAL

If you’re familiar with SQL, the concept of a CROSS JOIN is probably familiar to you. When you CROSS JOIN two (or more) tables, the result is a product of the tables. For every row from the left table, iterate all rows from the right table and then repeat for each row of the left table. We can see a simple example of this using the generate_series() table function to CROSS JOIN two generated tables.

You can see in the output, the values in gs1 are matched with every row created in gs2:

This SQL can be simplified a little bit more, however. Anywhere you see CROSS JOIN (LATERAL) can be replaced with a comma and get the same result.

This query will produce the exact same data as above.

Although everyone has their preferences, this is how I typically write queries when I want to CROSS JOIN tables. It’s succinct and uses fewer keystrokes. If you want to always provide clarity for others keep using the specific CROSS JOIN keywords.

But there’s more! In PostgreSQL, that comma between two tables actually represents a CROSS JOIN LATERAL, and we can use this to our advantage when processing data.

A JOIN LATERAL in PostgreSQL (otherwise known as LATERAL subqueries) means that any table or function on the right, can reference result data, row by row, from any table on the left. When used in conjunction with table functions, we can perform data transformation for each row and return the results as a table to be queried directly in the SELECT clause rather than as part of a derived table.

To demonstrate, let’s go back to the first query that we used as the source of a derived table, which allowed us to use the ROW_NUMBER() window function to get the position of each card.

With one small change, we’re going to modify this query just slightly so that we can simplify the final query even more in the next step.

The truncated output is:

Because the string_to_table() function is on the right side of the CROSS JOIN LATERAL, it can reference the ‘lines’ column from the ‘dec07’ table on the left and return the rows of each card for the SELECT clause. At first it might not seem like much of a win. After all, we lost the ordinal position number of each card in the hand.

But as you probably guessed, PostgreSQL has another trick up its sleeve to help get that value back more efficiently.

WITH ORDINALITY

When a table function is used as part of the FROM clause, the WITH ORDINALITY() function can be added to return the ordinal position of each row that is returned from the function. This means we don’t need to use the ROW_NUMBER() window function later in the query. In many cases this can be a major win because the window function requires the entire result be iterated a second time to generate the row number column with the proper PARTITION BY and ORDER BY.

Rather than the first version of this query…

…we can remove the derived table and add WITH ORDINALITY() after the string_to_table() function. When we give the output of the table function an alias, there are now two columns to account for, the result of the function and the ordinal value – `WITH ORDINALITY t(card, position)`.

Going one step further, we can also “hide” the other split_part() function by making it the source of another CROSS JOIN LATERAL. This function can also reference the output of any table to its left. In this case the same column from the ‘dec07’ table. Can you see how the SELECT clause has been cleaned up by using the table alias?

This returns the following truncated results:

Looking at the result, you can see that using WITH ORDINALITY() after any table function has a similar effect to using the ROW_NUMBER()window function. In this case the split_to_table() function emits one row per card, per hand in the string, each with the ordinal position for that hand. Once the next row from the data table is processed, the ordinal position resets because it’s a new execution of the table function.

Summary

In this first part of a three-part series. I have started down the path of extracting data from other data. While this type of operation is not supposed to be common in a normalized database, it is a skill that is valuable to have for importing data from flat files.

We covered several functions used to extract data from a string such as split_part(), split_to_table() and regexp_split_to_table(). We also covered the concept of a LATERAL JOIN that can be used to apply values from one set to another in a join, with the second set most often a table valued function.

In the next parts of this three-part series, I will extend these concepts using common table expressions to transform the date in more interesting ways.