Calculating Gaps Between Overlapping Time Intervals in SQL

There are a number of real-life reporting tasks in SQL that require a 'gaps and islands' analysis. There are a number of techniques around that work, but finding ones that scale well makes for a tougher, but interesting, challenge.

When building a web site that requires login access it is normal practice, and even a statutory requirement in some places, that you keep a record of user logins.  This log can be invaluable for high-usage web sites in order to identify periods when the web site is infrequently used, so that a suitable time for maintenance such as operating system patching can be scheduled.

Since you have no control over when authorized users log into your web sites, you’ll always have periods when more than one are logged in at the same time.  Login and logout times will be staggered, and so the intervals when users are logged in will be overlapping.  Today we’ll ultimately be showing you how to find the gaps of inactivity, but to do so we will also show you a neat way to identify those periods where there is activity.

Although this may seem like a slightly contrived problem, there are plenty of times that you are faced with the sort of reporting where special SQL techniques are required. For example:

  • If you have a fleet of vehicles where events represent departure and return, you may wish to ask during what periods are at least one of my vehicles out and when are they all not being used?
  • In a manufacturing plant with multiple assembly lines where events are line starts and line stops, you can find out during what periods at least one line is in use or the periods when all lines are idle.
  • By adding a bit of PARTITIONing to these queries, you can handle the case of a hotel that has multiple types of rooms.  Then you can ask the questions, during what times is at least one room of a type occupied or when are all rooms of a particular type unoccupied.

New features in SQL 2012 have been developed to make short work of some of these gaps problems, so we will need SQL 2012 to run many of the examples, specifically those that relate to data clean-up.  These solutions are not limited to login/logout intervals; actually any defined interval between specific events will do.

The accompanying resource file: Sample Queries (2012).sql contains all of the sample queries shown in the following sections.  All SQL resource files are appended with either (2008) or (2012) to indicate which version of SQL they can be run in.

Problem Setup and Explanation

First we’ll create a table with some basic set-up data that represents three user logins.  The login periods will overlap and contain a gap to illustrate the essentials of what we are trying to do.

The raw sample data displayed by the final SELECT are as follows:

We can see that logins ([Event]=0) and logouts ([Event]=1) are evenly matched, meaning the logout time was recorded for each login.  Let’s take a look graphically at these four events.

1942-clip_image002.jpg

Each login ID is represented by a different color.  You can see that there are overlaps between the grouping of logins on the afternoon of 01 Sep and the single login by UserID=1 on 02 Sep.  From the data, we ultimately seek to expose the gap between the logout of UserID=3 (blue) and the second login of UserID=1 (2013-09-01 19:10 to 2013-09-02 11:05).

Pairing of Different, Sequential Events

The following solutions assume that concurrent logins by a single account are not allowed.  When your login/logout events are equally paired, i.e., there are no missing logout endpoints, a solution such as I’ll be describing will group these events so that your query can place start/end times of the events on a single row.

Sample Query #1: How to Pair up Logins with Logouts

Results from Sample Query #1:

Note how, in the derived table, we’ve created a grouping column as (ROW_NUMBER()+1)/2, which is a well-known trick for pairing consecutive rows.  Here are the results in rn for the first four rows:

Many times, usually due to incomplete implementation, web sites will not record or generate a logout event.  This clutters the data with spurious, unpaired logins.  Let’s add a couple of these to our data.

The sample data now appears as follows, where IDs 1 and 10 represent a login that is not paired with a corresponding logout event.

In order to pair logins with logouts in this case a different approach is required.  Note the use of OUTER APPLY (rather than a CROSS APPLY) is required to bring into the results set the final, unpaired login for UserID=2.

Sample Query #2: Pairing Logins with no Corresponding Logout

Since the execution plan of this query shows two Clustered Index Scans, compared to the execution plan of the first showing a single Clustered Index Scan, we’d expect the second query to be a bit slower than the first.  So this data clean-up comes at a cost. 

Results from Sample Query #2:

Unfortunately, further testing of this pairing method using a large row set identified that it is quite slow.  To find a better alternative, we’ll turn to the new LEAD analytic function in SQL 2012.  You’ll find that the following query returns the exact same row set as the prior, but ultimately performs much better (in fact it takes about 10% of the time to run).

Sample Query #3: Using LEAD to Improve Performance (where some logins have no logout)

It is quite easy to eliminate the rows with missing LogoutTimes in either of the above queries by adding a check for NULL LogoutTime to the WHERE clause.

It would also be possible to implement other business rules for cases where no logout time is recorded for a login event, for example:

  • Use the average duration of that user’s logins to predict an approximate LogoutTime.
  •  Assume the user is still logged in, so substitute GETDATE() for the LogoutTime.

We will leave these as exercises for the interested reader.

Packing the Overlapping Login Intervals

Sometime back, a good friend of mine pointed me to an article that describes a way to pack overlapping intervals.  That article is called Packing Intervals and it is by none other than SQL MVP and all-around guru, Mr. Itzik Ben-Gan.  When I first read that article, I was impressed by the elegantly simple explanation for how this highly imaginative query works, so I won’t try to inadequately reproduce it here.  What I can tell you is that it took me all of about three milliseconds (coincidentally the resolution of the SQL DATETIME data type) to realize that this was something I needed to retain in my tool box.  Let’s look at how we would apply this technique to our sample data, by first putting our Example Query #3 (eliminating the spurious logins) into a Common Table Expression and then applying the Packing Intervals technique.

Sample Query #4: Pack the Overlapping Intervals down to non-overlapping Events

It is instructive to see how each successive CTE (ignoring the first because we’ve already seen what it does) returns a row set that brings us closer to our target of non-overlapping intervals.  Rather than being repetitive, I’ll refer you to the comments at each step (above).

Results from C1:

Results from C2:

Results from C3:

The results produced by this fascinating method are shown below.

Final Results from Sample Query #4:

If we return to look at Sample Query Results #2, we find that these represent the packed intervals across all of the UserIDs.  You can also think of these as “islands” of time where users are logged into our system.

Before we move on to finding the gaps in these islands, let’s take a closer look at what’s happening in CTEs GroupedLogins and C1.  The only reason we have GroupedLogins to begin with is to remove the spurious logins.  Looking at the work being done by C1, it is ungrouping that which we just grouped.  This leads us to believe that if the spurious logins were not present it might be possible to simplify the query, so let’s try to do this after first deleting the two rows we added to our original sample data.

Sample Query #5: Simplify the Packing Method by using our logins/logouts directly

Results from Sample Query #5:

These correspond to the packed intervals in our original data, including the depiction in our diagram.  Note that it is slightly different than the packed results we got from Sample Query #4, because the “spurious logins” that we removed resulted in slightly different overlapping intervals.

Converting Packed Intervals to the Corresponding Gaps

Once we have our packed intervals (“islands”) it is simple enough to add just a little bit of code to convert these into gaps.  We draw upon a previous Simple Talk article called The SQL of Gaps and Islands in Sequence Numbers and look for the CROSS APPLY VALUES method to convert islands to gaps.  We will add this technique to Sample Query #5 to get to:

Sample Query #6: Convert Packed Intervals into Gaps Between them

Results from Sample Query #6:

Obviously, if we needed to handle spurious logins, the little bit of code we added after C4 could be applied just as easily to Sample Query #4.

Building a Test Harness

Now that we have a way to calculate gaps between overlapping intervals, we want to make sure that the code we’ve built will scale to large row sets in a reasonable fashion.  To do this, we must build a test harness with a realistic number of rows.

We can simulate the duration of a login by starting the login at a particular time of day and then assume that the user was logged in for a random period of time.  In SQL, we can generate uniform random numbers like this:

This will generate a uniformly-distributed random integer between 10 and 130, which could for example represent the minutes the user was logged in.  But we’re going to have a little fun and say that our logins are not based on a uniform distribution, rather the time logged in will behave more like a normal distribution where the mean is 240 seconds and the standard deviation is 60 seconds.  To do this, we need a way to generate normally distributed random numbers.

The function presented above does this.  Generating Non-uniform Random Numbers with SQL has an empirical proof that it does, and you will also find several other random number generators for other distributions contained therein.  To use this function, you must pass it two uniform random numbers on the interval 0 to 1.

The test data (1,000,000 rows) can be generated based on 500 days and 1000 UserIds as follows:

Each UserID logs in once per day.  Rows resulting from the query can be increased in increments of 1,000,000 rows by changing @Days to 1000, 1500 and 2000.  Using Sample Query #6 with the above table generates about 30,000 packed intervals and consequently nearly the same number of gaps (actually exactly one less).  This can be demonstrated by running the SQL code stored in the resource file: Check Test Harness (2008).sql.

Since each of our queries is built upon a series of steps, we’ll break each step down and get timings on each.  Two additional test harness SQL files are provided:

  • Test Harness 1 (2008).sql – tests 2 queries which are like Sample Query #5 plus the additional code to convert those islands to gaps (Sample Query #6).
  • Test Harness 2 (2012).sql – tests 3 queries which are like Sample Query #3 (removing the logins with missing logouts), Sample Query #4 (to calculate the packed intervals) and finally with the additional code to convert those islands into gaps. 

The DELETE statement below appears in Test Harness 2 (2012).sql to remove a few logout events.

This deletes about 1,667 logout rows out of 1,000,000 (about 0.17%).

Performance Results

Using Test Harness 1 (2008).sql (run in SQL 2008) we can demonstrate that the additional time for CPU and Elapsed run times when adding the CROSS APPLY VALUES Islands to Gaps method to the interval packing is nearly insignificant.  Together the solutions both scale quite linearly and are fairly reasonable when there are no missing logouts.

1942-clip_image004.gif

The result below at 4M rows for elapsed time (Interval Gaps less than Pack Intervals) can only be explained by an increasing amount of parallelism introduced by SQL Server into the query.

1942-clip_image006.gif

We see a similar story when reviewing the results when there are missing logouts (using Test Harness 2 (2012).sql (run in SQL 2012).  Simply needing to remove the missing logouts takes slightly more than 50% of the CPU, but when we layer calculating the gaps on top of the packing algorithm, that additional work is quite minimal.

1942-clip_image008.gif

Elapsed times tell a similar story except that the amount of time for removing the open-ended logins does not comprise nearly as large a percentage of the overall time.  Nonetheless, both interval packing and calculating the gaps between them scales quite linearly.

1942-clip_image010.gif

Note that test results for the cases of no vs. missing logouts were run on different machines due to the need for SQL 2012 for the latter, so the CPU/Elapsed times recorded may not be comparable in terms of raw magnitude.  Test machines:

  • No Missing Logouts: Windows 7 Professional and SQL 2008 R2 (both 64 bit), Dell Inspiron laptop w-Intel Core i5-2430M @ 2.4 GHz, 8GB RAM
  •  Missing Logouts: Windows 7 Enterprise and SQL 2012 (both 64 bit), Lenovo laptop w-Intel Core i5-3230M @ 2.60 GHz, 12GB RAM

Conclusion

The important concepts we have demonstrated in this article are:

  • Combining rows where login and logout events are paired.
  •  A fast solution using the SQL 2012 analytic function LEAD for combining rows where login and logout events are paired, but some of the logout events are missing.
  • The excellent solution proposed by SQL MVP Itzik Ben-Gan for packing intervals, with modifications to support both of the above cases.
  •  Using the CROSS APPLY VALUES approach to convert islands (packed intervals) to gaps (the periods of no login activity).
  •  Finally, and very importantly, that these solutions seem linearly scalable to multi-million row sets.

Further Reading

Dealing with event intervals is a fascinating and challenging problem in SQL.  If you are interested in pursuing this further, I strongly recommend that you take a look at some of the links below.  They provide even more advanced, high-performing techniques for dealing with time intervals.

Thank you for your attention and we hope you found the information in this article interesting and ultimately useful.