Overlapping Ranges in Subsets in PostgreSQL

Comments 0

Share to social media

In Part I of this series, we explored overlap queries and the problems that accompany them, along with ways to improve their performance. We concluded by defining an overlap query in terms of timestamp ranges, which allowed us to use a new index type – known as GiST – that drastically sped up these types of queries. In Part II, we’ll boost overlap query performance even further, but first let’s cover some other features of range types.

A Range For All Occasions

Our Part I query used the following WHERE clause:

The “tsrange()” functions return timestamp ranges. But overlap queries aren’t limited to timestamps; they can be constructed from integers and floating-point values too. Imagine an arbitrage database that tracks the minimum and maximum price paid for a commodity. If you wish to find all customers who’ve paid between $1.35 and $1.65 for pork bellies, you would construct a numeric range using the “numrange()” function; the query WHERE clause would be:

For integer and bigint ranges, there are the functions int4range() and int8range(). Date-only chronological intervals (without time), can be constructed with daterange(), and there’s even a tstzrange() function, which takes into account timezone on the timestamp values involves. (But remember: like database columns defined using “with time zone”, the actual time zones aren’t stored, just a flag indicating that the timestamps should be converted to local time on the client.)

Preventing Overlapping Table Rows

A common need for overlap queries is to prevent rows containing overlapping time periods from being inserted into a table, similar to a UNIQUE restriction. For instance, a conference room booking list that doesn’t allow simultaneous reservations, or a podcast schedule that allows only one episode playing at any given time. Business logic like this can be done within the database using a trigger or stored procedure. But using GiST, there’s a simpler way.

To illustrate, consider a basic table for upcoming podcast episodes, holding the episode identifier, and two timestamps, indicating the planned start and stop times. To prevent any two rows from overlapping in time, use the “exclude” operator as shown below:

We can verify this by trying to insert two rows that conflict with each other:

Since episode 2 starts before episode 1 ends, the second insert fails with an error that starts like this:

ERROR: conflicting key value violates exclusion constraint …

Like a UNIQUE constraint, EXCLUDE creates the underlying index to support the constraint, meaning you don’t need to follow the table definition with a separate index creation statement.

Note: an actual business scenario would likely require additional criteria to determine an overlap error. A hotel reservation system would need the room number, for instance, or our podcast example above might support multiple channels, each broadcasting simultaneously. This can be done in an EXCLUDE statement, but it requires adding B-tree support to GiST, as described below.

Modifying Range Semantics

In Part I, we noted that standard SQL semantics define ranges as “half-open” — the start point of a range is inclusive (included in the range), but not the end point is exclusive (not included.) (Joe Celko noted in this article the idea of half open high, and half open low as well). This method of storing ranges simplifies storage, but tends to complicate overlap calculations: an interval that ends at 5pm will not overlap with one that begins precisely at that same time (though it will overlap with one that begins even a millisecond sooner). All range construction functions (tsrange(), daterange(), etc.) follow these same semantics.

You can override this behavior easily. The constructor functions accept a third parameter: a two-character string indicating the status of each endpoint: a square bracket indicates inclusion; a parenthesis denotes exclusion. The following chart lists the possibilities:

Parameter

Behavior

‘()’

exclude both endpoints

‘[]’

include both endpoints

‘[)’

include start point; exclude end. (default)

‘(]’

exclude start point; include end.

To illustrate, let’s return to our podcast episodes example. In it, we created a “no overlap” constraint using the default “half open” semantics. This allows episodes that begin at exactly the same time as a previous episode ends. For instance, the following two rows don’t overlap, even though the end of one episode matches the start of the other:

But when creating the constraint, if we instruct the tsrange() function to include both endpoints like this:

The second INSERT would fail due to the constraint.

Boosting Performance: a Different Shade of GiST

Now let’s return to our main point: boosting performance. The GiST index type is a good choice for general-purpose overlap queries. But often, time-series data is distributed fairly evenly across a range of dates, without a great deal of overlapping values. A good example is our sample data in Part I: it’s randomly spread across a 20-year interval, and any given row overlaps with only a small fraction of the other rows in the table. For data like this, a space-partitioned GiST index can yield better performance. PostgreSQL also supports this type of index – called “SP-GiST”.

The statement to create an SP-GiST index is nearly the same as regular GiST. Using our “patrons” example table from Part I, we created a normal GiST index like this:

For SP-GiST, we change the index type to “spgist” and specify an operator class:

Here, the built-in “range_ops” operator (unsurprisingly) means we want to perform range operations on ou r index.

How much of a performance increase can we expect from our space-partitioned version of GiST? On our Part I test data, replacing the old GiST index with SP-GiST reduced the query time from 5.2s down to 3.7s – a healthy 30% boost. In a best-case scenario (say a table in which overlapping rows were disallowed), you might expect as much as a 50% increase. But on datasets with a high degree of overlap, SPGiST can perform worse. Test against your own data!

The SP-GiST index type has some drawbacks. For one, it doesn’t support multi-column indexes, meaning the next technique we describe below can’t be used with them.

Adding B-Tree support to GiST

Usually, an overlap query has criteria beyond the ranges themselves. A restaurant reservation doesn’t overlap with another unless they’re for the same table; a hospital can schedule multiple visits at the same time-period if the patients are different, etc.

Returning to our sample code in Part I which used an example of patrons and security outages at one museum: what if the data included multiple sites? Now an “overlap” only occurs if the patron and outage occurred at the same location. How does this change things?

Let’s recreate our tables, this time adding a location_id foreign key:

Our query for overlapping values now has a join condition, rather than a CROSS JOIN:

Since this is randomly generated data, you may get back more or less rows, so your performance may vary.

If you run this query using the same GiST index we created in Part I, performance will be from bad to abysmal, depending on how many different location values exist. Let’s assume there are 20 possible locations and see if we can again improve performance.

Our natural instinct for a two-condition query like this is a multi-column index, one for each condition. And, since the timestamp range is much more specific than the location_id (in technical terms, its cardinality is higher), it should be the first column indexed. That leads us to try this command:

Our instincts are good, but on a default configuration of PostgreSQL, this statement displays an error.

ERROR: data type integer has no default operator class for access method "gist"

As powerful as they are, GiST indexes don’t support simple operations like direct comparisons of strings, dates, or (in this case) integers. Luckily, PostgreSQL ships with an extension that adds B-Tree type operations to GiST. All we need to do is install it:

Now we can create the index above without error, and a similar index on outages:

This not only restores the performance in the query from Part I that lacked location_id:

But betters it slightly (4.9s on my test system, again with randomly generated data, as you can see in the full script to populate the test table, create indexes, and perform the query in Appendix I, below.

Note: Our location_id is a low-cardinality value (only 20 possible values, because I used TRUNC(RANDOM()*20) to generate them).

If there were many more locations (e.g. 100,000+), the index might likely be better structured with location_id as the first column; but in that case, performance would likely be no better than a default B-Tree index, as that’s the scenario these indexes are designed for.

Again: test with your own data!

In Part III of this series, we’ll explore uses of GiST beyond overlap-type queries.

Appendix I : Sample SQL Script

Here are some useful scripts for working with the example code in the article:

For doing some of the examples, particularly those where timing is important, you can use this script to generate data.

The following will create the indexes mentioned in this article. They will require the BTREE_GIST extension to execute.

Load comments

About the author

Lee is a 20 year industry veteran who has worked with IBM, AT&T, and many other firms. He holds a degree in physics, and more than 30 software-related patents. He currently works as a project consultant, specializing in database design and optimization. He can be reached at leeasher@myyahoo.com.