5 Strategies to Refactor SQL Code

Comments 0

Share to social media

Code refactoring is a common process when developing in procedural languages – and essential to developing high-quality code – yet somehow often gets overlooked in SQL. In this article, we’ll explain what refactoring is, how it helps, and give concrete examples on how it can make your code more readable, reliable, and maintainable.

What is Refactoring?

Over time the term “refactoring” has expanded and is sometimes used to mean code quality improvement in general, but here we are using it with its original meaning: condensing and eliminating redundant segments of code. Like factoring a number in math, we break the code into smaller blocks, identify any repeated elements, then replace them with a single reference. The graphic below illustrates the basic concept of what refactoring tries to accomplish:

A graphic illustrating the basic concept of what refactoring tries to accomplish.

Refactoring can be considered a case of the DRY (Don’t Repeat Yourself) principle of software development. Redundant code is harder to read, more prone to bugs, and significantly more expensive to maintain.

Usually, refactoring doesn’t help performance; it just improves the code quality.  But on shorter queries where parsing is a substantial portion of the total execution time, refactoring can improve performance too.

We’ll start with some tips on refactoring single queries, then move on to strategies for refactoring across your entire code base. We’ll demonstrate using short examples, but real-world queries can benefit even more. One of the most extreme cases I encountered, refactoring reduced a 1,200 line query down to just 90 lines – without, of course, altering the results at all.

1. Use a Derived Table

Even novice SQL developers are usually adept at using subqueries in a WHERE clause or as a column in a SELECT. But a subquery that returns a table-valued object can also be the target of the FROM clause, meaning we can use the result just as we would a base table. Such derived tables are ideal for creating intermediate values for further processing. The basic syntax looks like this:

Let’s turn these words into an actual example. Imagine a company that awards a pay modifier of 1% for every month an employee has worked beyond their first twelve months. Our task is a query that calculates for each employee the value of this modifier, along with modified values for their base and bonus pay. Note: we’re using MS SQL Server date functions. The syntax for PostgreSQL or MySQL is slightly different.

Our employee table contains the following columns:

The query to calculate the pay values is straightforward, though very redundant – we have to repeat the date arithmetic three times:

But if we move that date arithmetic into a subquery, the redundancy vanishes:

Note: a version of this with PostgreSQL-format date arithmetic is in Appendix I.

Remember that a derived table is a table – we can join other tables to it directly, just as if it was a base table. So we can use this technique to simplify much larger, more complex queries than this example.

2. Lateral Join

A lateral join is similar to a correlated subquery: both are evaluated once for every row in the main query. But where a subquery can only return a single value, a lateral join can return an entire row of values, or even multiple rows. This makes it an obvious choice for refactoring queries that contain multiple identical or similar subqueries. Again, let’s illustrate this with a concrete example.

Using our employees table from above, let’s write a query to pair each employee with the name and date of the first company “team building” event occurring after they were hired. We have a new table of team events:

Matching the right team event to each employee is easy enough: find event dates greater than the employee hire date, order by event_date, then use TOP 1 (or LIMIT 1, for MySQL/PostgreSQL) to match a single row. The problem is we can’t use a normal join because of that pesky one-row limit. So you might be tempted to use subqueries:

If you had to select three or more columns in such a manner, you can see this quickly becomes a mess. But a lateral join handily removes the redundancy:

(Note: SQL Server uses the ‘APPLY’ keyword for lateral joins; see Appendix I)

A Left Join is used to ensure all employees are listed, even if no matching event exists. Notice that our join condition is just the “true” flag. This is because we encapsulate the join condition within the lateral subquery.

3. Common Table Expression (CTE)

The word “common” in ‘Common Table Expression’ is a hint on how useful these can be for removing redundancy in queries. Our first two tips are limited on where the refactored code can be used. But CTEs are much more flexible: define it at the start of your query and use it anywhere you like, as often as you like. That makes CTEs the ‘Swiss army knife’ of SQL code refactoring.

To illustrate, let’s expand our last example of team building events. These events hold a competition that matches each employee to everyone else present, and we want a query to reflect this. To generate this “round robin” result, we use the query from above to match employees to events, then, using the same query, self-join the result table to itself. The resulting query is confusing and hard to read – the redundant blocks are highlighted to identify them:

Using a CTE, we can condense the two redundant blocks into one, making the code shorter and much easier to understand:

Here we name our CTE “emp_event” to indicate its joining employees to events. While we use the CTE twice here, a CTE can be used three, four, or more times in a single query. Even if you use it just once, a CTE can still be useful. It doesn’t shorten the query, but separating out a block of code and naming it to indicate purpose can help organize a large query.

4. Defining a View

All our examples so far have involved a single query. Let’s look more globally: refactoring across an entire database. Often a code pattern – especially those involving table joins – appears throughout a system over and over again. One of the easiest ways to eliminate this redundancy is by defining a view.

For our example, imagine a database tracking security alerts. The alert description is based on its type, and comes from one lookup table, while the alert priority is based on its source, and comes from another lookup table. So, whenever we access an alert, we must join the same three tables in the same manner:

The details of this query aren’t terribly important. The point is that it’s a block of code that’s likely to be repeated throughout an application, whenever an alert is accessed. On production systems, it’s common to see such join patterns – by themselves or as blocks in larger queries – repeated dozens or even hundreds of times.

Defining a view allows us to remove the redundancy, centralizing the business logic within the view, and allowing us to treat these three tables as if they were a single unit. This is as simple as prefacing the code in question with a CREATE VIEW statement:

We can of course query this view directly. But we can also treat it as a table, including it as part of a larger query, for example:

or:

If our business logic involving alert definition changes, we update one view, rather than performing edits scattered across multiple queries.

5. Generated Columns

Generated columns (known as computed columns in SQL Server) allow us to define a repetitive calculation inside a table definition, to be done automatically whenever any query or application accesses the table. This not only eliminates redundancy, but it centralizes business logic at the DDL (data definition layer) – preventing any query or application from overriding or altering it.

In our employee table, let’s add a column with a code to indicate employee type: ‘P’ for permanent, ‘S’ for temporary seasonal help, and ‘I’ for interns. To determine how many days per year of paid time off they’re eligible for, the following rule is used:

  • Permanent employees receive 10 days.
  • Interns receive 0 days.
  • Seasonal help hired in December receive 1 day, otherwise they receive 3.

Calculating this value can be done with a single CASE statement, but a large system might use that value in dozens of forms, screens and queries across multiple applications. Repeating the code to calculate it is wasteful and prone to error and, if the code ever changes, updating can be a nightmare. A generated (computed) column is a better approach.

In SQL Server, the code is straightforward (only our two new columns are included here):

In MySQL, the code is similar, except it requires us to explicitly state the column’s data type:

In both cases, we’ve defined the column as virtual: it’s calculated on the fly when we query its value. Both SQL Server and MySQL have a second option, which physically stores the calculated value in the table itself.  One benefit of storing the generated column is that it allows you to index on it. When a query needs to order by or quickly search for the results of a calculation, this is a very handy feature.

Note: At the time of writing, PostgreSQL didn’t support virtual generated columns – the only option was store the value – but as of PostgreSQL 18 (released in September), this functionality was added.

Final Thoughts & Next Steps

Like any other programming language, SQL benefits from the compact, readable, and well-structured code that refactoring creates. These five techniques aren’t the only ways to get there, but they’re all useful and should be part of your daily toolkit.

Please scroll to the bottom of the article for the full sample code.

Frequently Asked Questions (FAQ)

What is SQL code refactoring?

Refactoring SQL code means improving the structure, readability and maintainability of SQL queries and database objects without altering their external behavior.

Why refactor SQL code?

Over time, SQL scripts can accumulate technical debt – making them harder to read, brittle to schema changes and expensive to maintain. Refactoring improves clarity and lowers risk.

When should you refactor SQL code?

Consider refactoring when you see messy formatting, SELECT * wildcards, dynamic SQL constructs, duplicated logic, or frequent schema changes breaking queries.

What are key strategies for refactoring SQL code?

  • Apply consistent formatting and naming conventions.
  • Replace wildcards (SELECT *) with explicit column lists.
  • Replace dynamic filters and dynamic ORDER BY with static SQL when possible.
  • Decompose complex logic into variables or simpler statements.
  • Rename stored procedures, functions and objects to meaningful names.

What risks should be managed during SQL refactoring?

Changing code structure may affect performance, break dependencies, or introduce bugs if tests aren’t in place. Ensure you have good regression tests.

How should you approach refactoring in practice?

Start small, refactor incrementally, test thoroughly before and after each change, and use tools where available. Keep old and new code supported when transitioning.

What benefits result from well-executed SQL refactoring?

You’ll get cleaner, more maintainable code, better resilience to schema changes, faster onboarding for new developers, and lower long-term maintenance costs.

Appendix I: Sample Code

Part 1: Create tables and load test data

Part 2: Examples 1-3 in PostgreSQL format

Example 1, unfactored

Example 1, factored

Example 2, unfactored

Example 2, factored

Example 3, unfactored

Example 3, factored

Part 3: Examples in SQL Server format

Example 1, unfactored

Example 1, factored

Example 2, unfactored

Example 2, factored

Example 3, unfactored

Example 3, factored

Article tags

Load comments

About the author

Lee is a 20-year industry veteran who has worked with IBM, AT&T, and many other firms. He holds a degree in physics, and more than 30 software-related patents. He currently works as a project consultant, specializing in database design and optimization. He can be reached at leeasher@myyahoo.com.