Finding overlapping ranges of data

This week, I had a problem where I needed to find and eliminate from the results of my query, data with overlapping ranges. I have written about this topic before, in my database design book book, in regards to building a trigger to avoid overlapping ranges. But even though I have written on the topic there, I still use Google just like you to get quick help (I use books when I want to learn, or expand my knowledge on a topic in depth, blogs when I need a simple answer to a simple or complex question.)

The problem at hand is most often associated with date based data, such as effective dates for a row in a temporal/type 2 dimension table, or other cases like appointment times, etc. But the algorithm is the same with numbers and is a bit easier to read since we don’t have the same issues with roundoff and decimal places (the query is complex enough on its own to show in a blog post). From a progression of start and end values in each row, we are going to look at how to check to make sure that there are no two rows that are in conflict (no range should contain another ranges value at all).

In the following sample set, I included a group value, to make the example more typical, as usually you need to do this will tens of thousands or rows, looking for overlaps in subsets of that data. For example, consider a doctor’s office appointment system. You have offices, doctors, and assistants to assign to be in a location for particular times. You do not want to allow (or at least want to know about), where any of these items to have overlapping work scheduled. (The problem that prompted me to write this blog was a similar problem and used the same algorithm with ~50 million rows, taking just minutes on a very moderately spec’d SQL Server. I will hit a bit on performance a the end of the blog, after establishing the algorithm).

For the example, we will use the following data set:

GroupValue   Start  End
------------ ------ ------
X            1      2
X            3      4
X            5      6
X            8      10
X            9      15
X            11     14
X            16     16
Y            9      20
Y            22     24

There are two ways to implement a range of data, either with an endpoint that is in the range (inclusive), or one that is out of it (exclusive). Take the first two rows, they mean that the first range is from 1-2, inclusively, and 3-4.

GroupValue   Start  End
------------ ------ ------

X            1      2
X            3      4

For integers this is simple. But if you were doing numeric(10,3), the endpoint would need to be 2.999 and 4.999. Dates are more weird, particularly if you are still needing to deal with the datatime datatype, where the fractional time range goes from .997 to .000. 

This type of implementation is what will naturally occur if you are working with “real” data. Like if you want to know the range of values for a given group, you might say:

How you define the rest of the query is important to the problem you are solving, but the point here is that the max_value will end up inclusive by its nature, and adding a value to it to make it just outside of the range often makes the data more confusing to work with as you won’t be able to intuitively join to the row(s) that match the max_value anymore.

The other way to go is with an end value that is the smallest value that is out of the range. For our example, 1 to 2 would be represented as 1 and 3:

GroupValue   Start  End
------------ ------ ------

X            1      3
X            3      5

So in queries, instead of using BETWEEN, you would use >= Start and < End. This is a LOT easier to implement when you are coding a system to add a new row and expire the previous since you don’t have to deal with any precision of the datatype used, Just copy the expiring row’s end value to be the start of the new row. 

To demonstrate, I will start with the following table, and data, which is the data in the original table, with an identity primary key added, and unique constraint on the GroupValue, StartValue, and EndValue to avoid one very common overlapping scenario.

The algorithm for finding overlapping ranges is a matter of looking to see if one of two conditions exists in your data. We will define the conditions here in terms of our sample set:

1. Is a StartValue in one row in a Group is located between the StartValue and EndValue of another, in the same Group
2. Same as 1, but see if the EndValue of one row is between the StartValue and EndValue

If one range completely engulfs another; such as a row with values 10,20 would be engulfed by 1,100; then both of these criteria will be met. The query will look like this, with a fairly messy looking CASE expression to allow you to see what the difference is between the rows. Note that I don’t use BETWEEN in the query, as you could, because when you need to convert to the non-inclusive type of implementation, the only real difference is in the magnitude comparison operators and so you can simply change a few characters and it works (which we will do a bit later on).  Where the two conditions are implemented are tagged with /* # */.

For our data, it returns the following:

GroupValueRangeId GroupValue StartValue  EndValue    Conflict
----------------- ---------- ----------- ----------- --------
4                 X          8           10          -------> ...
5                 X          9           15          -------> ...
5                 X          9           15          -------> ...

GroupValueRangeId GroupValue StartValue  EndValue Issue
----------------- ---------- ----------- -------- ---------------
5                 X          9           15       Start Between
4                 X          8           10       End Between
6                 X          11          14       Engulfed

You can see in the set that when only one condition is met, the rows will show up twice, since we joined the table to itself. When both conditions are met, the row will only show up once, because the start and end values will be between the other copy of the data’s start and end value, but not vice versa.

Looking at the rows in output, you can see for the first row, that for X,9,15, the 9 is in the range of X,8,10, so it is returned. Naturally, the opposite is true, in that the endpoint of X,8,10 is between the start and end of X,9,15. This query is good enough if your goal is to determine if there is an error in your data. For example if you are trying to write a trigger to prevent overlaps, the query can simply be put in an IF EXISTS (the query) THROW construct. If the query returns data, there is an issue. But if you want a list of all of the rows that have issues, the duplication of data can be confusing. So we can change the query to return one more row by looking for the inverse of the engulfed criteria:

Now the output is all of the rows that have an issue with overlapping values, without the data that they are in conflict with. 

GroupValueRangeId GroupValue StartValue  EndValue    Issue
----------------- ---------- ----------- ----------- ---------------
4                 X          8           10          Start Between
5                 X          9           15          End Between
5                 X          9           15          Engulfs
6                 X          11          14          Engulfed

This sort of output would be useful in cases where you just want to not use any of the bad rows, but do want to use all of the rows that are okay. So you could use this query (probably without the issue column,) in a CTE, and in our query just exclude these rows using a NOT IN expression (if you have a surrogate key, it would likely fit into the output to make that exclusion easier.

The problem of overlapping rows when you are dealing with inclusive ranges is not a complex one, but it is tricky. I never can quite remember the basic algorithm right off the top of my head, and adding the output as to where the issues is messy.  With exclusive ranges, the main difference is that we have to change a few mathematical comparison operators, but it takes a bit of thought. Taking our example table, I will add an ExclusiveExclusiveEndValue column as a computed column by just adding 1 to the value, since for integers, the next value that is not in range is as simple as that:

Taking a look at the data:

You can see it is the same data, with just an extra column:

GroupValue StartValue  EndValue ExclusiveEndValue
---------- ----------- -------- ---------------------------
X          1           2        3
X          3           4        5
X          5           6        7
X          8           10       11
X          9           15       16
X          11          14       15
X          16          16       17
Y          9           20       21
Y          22          24       25

For the StartValue test, we need to do >= for the comparison set’s StartValue, but < for their EndValue. For the EndValue test (or in the code, ExclusiveEndValue), we do the opposite > the StartValue, and <= the EndValue; since we are comparing in two different directions. So the code changes to:

Not significant changes at all, but certainly enough to trip you up (I got it wrong several times when I was putting this together, for sure.)  Now if I (and hopefully you, the reader) am faced with this kind of issues, I can simply take the query and cut and paste the names of the table and columns from the table and use the code directly.

As an example that is fairly common with data warehousing folks, a Type 2 dimension will have a time range that the data is valid. So I will adapt this range query to the Employee Dimension in the WideWorldImportersDW database, which you can find here. This will be something that, if you have ever worked with Type 2, slowly changing dimensions, you will likely have need to do to make sure your table is correct.

You can see the similarities to the previous examples immediately, only in a table that has the dreaded “names with spaces”, so you will see a lot more square brackets than I usually like to include:

This returns the following rows, which show that the end points are inclusive. These Valid To and From values are used to determine which row to apply to a fact table at a given time period. (note, decimal points removed for clarity) :

Now I just replace the corresponding table and column names, and get the following query:

This returns:

You can check to see if this data is actually conflicting by running the following code:

This returns 2 rows, which should never occur in a dimension, but I am assuming this could be in there for some sort of demo. Either way, it was fortuitous for this example that the version of WideWorldImportersDW I downloaded had this issue.

Lastly, one minor point about performance. The algorithm’s performance is largely dependent on the number rows per grouping value. Take our original set of data:

GroupValue   Start  End
------------ ------ ------
X            1      2
X            3      4
X            5      6
X            8      10
X            9      15
X            11     14
X            16     16
Y            9      20
Y            22     24

When you do the query, you are joining all of the rows on the group value, which effectively leaves the cross product of the rows to be processed in a non-equi-join criteria. If you have a tremendous number of values per group, the process may be very slow. For example, I tried to adapt the query to the Warehouse.ColdRoomTemperatures_Archive table in the WideWorldImporters database (which you can get here). This table is the temporal archive table for the Warehouse.ColdRoomTemperatures table, which has 4 rows in it. The archive as over 3 million rows (which I didn’t notice when I started the query on Sunday night on my Surface, and found out when I stopped the query on Monday morning. I also didn’t check for indexes, which it also didn’t have.) 

It will behoove you to have  indexes on the GroupValue, the primary key (usually in the clustered index), and include the start and end times, particularly if this is a wide table.

Summary

Overlapping ranges are not something you need to deal with very often, but when you do, the way they are validated is something I frequently don’t remember. Just remember that you are looking for the start or end of one row in a group to be inside the start and end of another row, and the process is relatively easy.