{"id":1606,"date":"2013-03-18T00:00:00","date_gmt":"2013-03-18T00:00:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/solving-complex-t-sql-problems-step-by-step\/"},"modified":"2021-09-29T16:21:45","modified_gmt":"2021-09-29T16:21:45","slug":"solving-complex-t-sql-problems-step-by-step","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/solving-complex-t-sql-problems-step-by-step\/","title":{"rendered":"Solving Complex T-SQL Problems, Step-By-Step"},"content":{"rendered":"<div class=\"article-content\">\n<p class=\"start\">Repeatedly on the SQL Server forums, I see people struggling to solve common, complex business problems in T-SQL. More to the point, I see them struggling to solve them in a way that performs well and scales gracefully to large numbers of rows. My systematic approach to writing &#8216;difficult&#8217; T-SQL is always the same:<\/p>\n<ol>\n<li>Examine the data set carefully, and note down any particular characteristics, especially appearance of <code>NULL<\/code> values, use of dates, and so on.<\/li>\n<li>Make sure I fully understand the expected results and validate them, manually.<\/li>\n<li>Break the problem down into a series of simple steps, write the T-SQL, view the results, adjust if necessary, and move on to the next step.<\/li>\n<\/ol>\n<p>The key at each step is to understand and accept that the first, most intuitive solution is unlikely to be the fastest or most scalable. As I explained in my previous series, <a href=\"https:\/\/www.simple-talk.com\/sql\/performance\/writing-efficient-sql-set-based-speed-phreakery\/\">Writing Efficient SQL<\/a>, we need to look for opportunities to use set-based techniques that will help avoid non-scalable row-by-row processing, work with the smallest data set for any expensive calculations, scan the data as few times as possible, and so on. If my solution still fails to meet my expectations for performance, I try to &#8220;pair&#8221; with a co-worker, or ask advice from a T-SQL guru in the community, to help me find a better solution.<\/p>\n<p>I recently helped solve a business problem that illustrates this process, and these ideas, perfectly. Based on data from a time clock, the challenge was to calculate based on a series of entry and exit times for employees, the amount of time between each entry and exit recorded, for each person. I present first a solution that will work on SQL Server 2008 and SQL Server 2008 R2, but has to scan the data a number of times to arrive at the result. Since that solution was not acceptable, I enlisted the help of a T-SQL guru, <a href=\"https:\/\/www.simple-talk.com\/author\/peter-larsson\/\">Peter Larsson<\/a> (known as &#8220;SwePeso&#8221; or &#8220;Peso&#8221; on the various community forums) who changed the approach to the problem, and found a better solution.<\/p>\n<p>Finally, I&#8217;ll demonstrate how new analytic functions in SQL Server 2012, <code>LAG<\/code> and <code>LEAD<\/code>, make it possible to find a solution that performs just as well, and requires a little less lateral thinking.<\/p>\n<h1>Understanding the Business Requirements and Data<\/h1>\n<p>Listing 1 provides the code to create the <code>StaffHours<\/code> table and populate it with data that represents fairly the time clock data set with which I worked. It uses row constructors, introduced with SQL Server 2008, and so the code will not run in any earlier versions (see the code download for the 2005 version).<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">--2008 &amp; 2012 version\r\nIF EXISTS ( SELECT * \r\n      FROM sys.tables \r\n      WHERE name = 'StaffHours' ) \r\n   DROP TABLE StaffHours; \r\n   \r\nCREATE TABLE StaffHours\r\n   ( \r\n      StaffMember CHAR(1) , \r\n      EventDate DATE , \r\n      EventTime TIME , \r\n      EventType VARCHAR(5) \r\n   ); \r\n\r\nINSERT INTO StaffHours\r\n   ( StaffMember, EventDate, EventTime, EventType ) \r\nVALUES \r\n   ( 'A', '2013-01-07', '08:00', 'Enter' ), \r\n   ( 'B', '2013-01-07', '08:10', 'Enter' ), \r\n   ( 'A', '2013-01-07', '11:30', 'Exit' ), \r\n   ( 'A', '2013-01-07', '11:35', 'Exit' ), \r\n   ( 'A', '2013-01-07', '12:45', 'Enter' ), \r\n   ( 'B', '2013-01-07', '16:45', 'Exit' ) \r\n   ( 'A', '2013-01-07', '17:30', 'Exit' )\r\n   ( 'A', '2013-01-07', '1:00', 'Exit' )\r\n   ( 'C', '2013-01-07', '08:33', 'Enter' ), \r\n   ( 'C', '2013-01-07', '17:33', 'Exit' ), \r\n   ( 'C', '2013-01-07', '17:35', 'Exit' );<\/pre>\n<p class=\"caption\">Listing 1: Creating and Populating the <code>StaffHours<\/code> table<\/p>\n<p>Figure 1 shows the &#8220;time clock&#8221; data set, essentially a series of entry and exit times from each member of staff.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-86d61e9f-53bd-461d-8a98-18e674fcad01.png\" alt=\"1767-1-86d61e9f-53bd-461d-8a98-18e674fca\" \/><\/p>\n<p class=\"caption\">Figure 1: The original time clock data set<\/p>\n<p>The requirement is to calculate the amount of time in hours, minutes and seconds between each <strong>Enter<\/strong> and <strong>E<\/strong><strong>xit<\/strong>, for each staff member. The first point to note is that while there is usually a matching <strong>Exit<\/strong> row for each <strong>Enter<\/strong>, sometimes there is an <strong>Enter<\/strong> with no <strong>Exit<\/strong> or an <strong>Exit<\/strong> with no <strong>Enter<\/strong>. Here are the business rules:<\/p>\n<ul>\n<li>If there are consecutive <strong>Enter<\/strong> rows, use the first one.<\/li>\n<li>If there are consecutive <strong>Exit<\/strong> rows, use the last one.<\/li>\n<li>Ignore any <strong>Exit<\/strong> rows with no matching <strong>Enter<\/strong> row (i.e. if the first row for an employee is an <strong>Exit<\/strong>, then ignore it).<\/li>\n<li>Ignore any <strong>Enter<\/strong> rows with no matching <strong>Exit<\/strong> row (i.e. if the last row for an employee is an <strong>Enter<\/strong>, then ignore it).<\/li>\n<li>The <strong>Exit<\/strong> for an <strong>Enter<\/strong> could be a subsequent day.<\/li>\n<\/ul>\n<p>We can validate our expected results, manually. Employee A starts a work session at 08:00 on Jan 7. There follows two Exit times and, according to our business rules, we use the last one (11:35), for a work session duration of 3 hours 35 minutes. Employee A enters again at 12:45 on the same day and again we have two Exit entries, the latter falling on the next day, to give 12:15:00 duration. In a similar fashion, we can hand-calculate the full result set, as shown in Figure 2.<\/p>\n<div>\n<table>\n<tbody>\n<tr>\n<td>\n<p>StaffMember<\/p>\n<\/td>\n<td>\n<p>EnterDateTime<\/p>\n<\/td>\n<td>\n<p>ExitDateTime<\/p>\n<\/td>\n<td>\n<p>WorkTime<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>A<\/p>\n<\/td>\n<td>\n<p>2013-01-07 08:00:00<\/p>\n<\/td>\n<td>2013-01-07 11:35:00<\/p>\n<p>&nbsp;<\/p>\n<\/td>\n<td>\n<p>03:35:00<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>A<\/p>\n<\/td>\n<td>\n<p>2013-01-07 12:45:00<\/p>\n<\/td>\n<td>\n<p>2013-01-08 01:00:00<\/p>\n<\/td>\n<td>\n<p>12:15:00<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>B<\/p>\n<\/td>\n<td>\n<p>2013-01-07 08:10:00<\/p>\n<\/td>\n<td>\n<p>2013-01-07 16:45:00<\/p>\n<\/td>\n<td>\n<p>08:35:00<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>C<\/p>\n<\/td>\n<td>\n<p>2013-01-07 08:33:00<\/p>\n<\/td>\n<td>\n<p>2013-01-07 17:35:00<\/p>\n<\/td>\n<td>\n<p>09:02:00<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p class=\"caption\">Figure 2: Expected results<\/p>\n<h1>A Pre-SQL Server 2012 Solution: pairing events using row numbers<\/h1>\n<p>Listing 2 shows my first solution, which uses <strong>row numbers<\/strong>and self-joins on matching rows, based on the row numbers. It takes a very systematic approach, using five levels of CTEs, as follows:<\/p>\n<ul>\n<li><code>Level1<\/code> &#8211; partition the data by <code>StaffMember<\/code> and apply <code>ROW_NUMBER<\/code><\/li>\n<li><code>Level2<\/code> &#8211; perform two self-joins to find, for each row, the value of the <code>EventType<\/code> for the previous and then the subsequent rows, in each partition<\/li>\n<li><code>Level3<\/code> &#8211; filter out unneeded rows and reapply row numbers<\/li>\n<li><code>Level4<\/code> &#8211; perform another self-join to pair up <code>ENTER<\/code> and <code>EXIT<\/code> rows<\/li>\n<li><code>Level5<\/code> &#8211; use <code>DATEDIFF<\/code> to calculate work session durations for each employee<\/li>\n<\/ul>\n<pre class=\"lang:tsql theme:ssms2012\">--SET STATISTICS IO ON;; \r\nWITH    Level1\r\n           AS (-- apply row numbers \r\n               SELECT StaffMember , \r\n                     CAST(EventDate AS DATETIME) \r\n                     +   CAST(EventTime AS DATETIME) AS EventDateTime , EventType , \r\n                     ROW_NUMBER() OVER ( PARTITION BY StaffMember \r\n                                  ORDER BY EventDate, EventTime ) AS RowNum \r\n              FROM StaffHours \r\n              ), \r\n        LEVEL2\r\n           AS (-- find the last and next event type for each row \r\n              SELECT   A.StaffMember , \r\n                       A.EventDateTime , \r\n                       A.EventType , \r\n                       COALESCE(LastVal.EventType, 'N\/A') AS LastEvent , \r\n                       COALESCE(NextVal.EventType, 'N\/A') AS NextEvent \r\n              FROM Level1 A \r\n                   LEFT JOIN Level1 LastVal \r\n                      ON A.StaffMember = LastVal.StaffMember \r\n                         AND A.RowNum - 1 = LastVal.RowNum \r\n                   LEFT JOIN Level1 NextVal \r\n                      ON A.StaffMember = NextVal.StaffMember \r\n                         AND A.RowNum + 1 = NextVal.RowNum ), \r\n        Level3 \r\n           AS (-- reapply row numbers to row-eliminated set \r\n              SELECT  StaffMember , \r\n                      EventDateTime , \r\n                      EventType , \r\n                      LastEvent , \r\n                      NextEvent , \r\n                      ROW_NUMBER() OVER ( PARTITION BY StaffMember \r\n                          ORDER BY EventDateTime ) AS RowNBr \r\n              FROM Level2 \r\n              WHERE   NOT ( EventType = 'Enter' \r\n                           AND LastEvent = 'Enter' \r\n                          ) \r\n                     AND NOT ( EventType = 'Exit' \r\n                              AND NextEvent = 'Exit' \r\n                              ) \r\n                     AND NOT ( EventType = 'Enter' \r\n                              AND NextEvent = 'N\/A' \r\n                              ) \r\n                     AND NOT ( EventType = 'Exit' \r\n                              AND LastEvent = 'N\/A' \r\n                              ) \r\n              ), \r\n        Level4\r\n           AS (-- pair enter and exit rows. \r\n               SELECT A.StaffMember , \r\n                      A.EventDateTime , \r\n                      B.EventDateTime AS ExitDateTime \r\n               FROM Level3 A \r\n                    JOIN Level3 B ON A.StaffMember = B.StaffMember \r\n                                     AND A.RowNBr + 1 = B.RowNBr \r\n               WHERE A.EventType = 'Enter' \r\n                     AND B.EventType = 'Exit' \r\n               ), \r\n        LEVEL5 \r\n           AS (--Calculate the work session duration \r\n               SELECT StaffMember , \r\n                      DATEDIFF(second, EventDateTime, ExitDateTime) \r\n                               AS Seconds , \r\n                      EventDateTime , \r\n                      ExitDateTime \r\n               FROM Level4\r\n              ) \r\n   SELECT StaffMember , \r\n          EventDateTime , \r\n          ExitDateTime , \r\n          seconds , \r\n          RIGHT('0' +   CAST(Seconds \/ 3600 AS VARCHAR(2)), 2) + ':' \r\n          + RIGHT('0' +   CAST(Seconds % 3600 \/ 60 AS VARCHAR(2)), 2) + ':' \r\n          + RIGHT('0' +   CAST(Seconds % 3600 % 60 AS VARCHAR(2)), 2) \r\n                               AS WorkTime \r\n   FROM Level5;<\/pre>\n<p class=\"caption\">Listing 2: Calculating work session duration using row numbers, CTEs and self-joins<\/p>\n<p>I spent quite a bit of time working on this query, taking it a step at a time, checking the results and moving on to the next level. However, as soon as I saw the <code>STATISTICS IO<\/code> information, I knew there was a problem.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">(4 row(s) affected) \r\n... \r\nTable 'StaffHours'. Scan count 6, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.\r\n    <\/pre>\n<p>The query scanned the table six times! Of course, with a small amount of data, this was not a problem, but the solution will never scale. Figure 3 visualizes the problem<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-SixScans-4ed13d38-240f-4646-add8-1b32acbbfb61.png\" alt=\"1767-SixScans-4ed13d38-240f-4646-add8-1b\" \/><\/p>\n<p class=\"caption\">Figure 3: Why we need six scans to arrive at the result<\/p>\n<p>While my solution is intuitive, I&#8217;d fallen into the trap of thinking about what I needed to do with <em>each row<\/em>, instead of considering what I could do by looking at the <em>data as a set<\/em>. This skill does not come easily to most of us, especially those with a background in programming that teaches an iterative approach.<\/p>\n<h1>A Better Pre-SQL Server 2012 Approach: Islands and Gaps<\/h1>\n<p>When I wrote a series of articles on the <em>SQL Server Speed Phreak<\/em> competitions (see <a href=\"https:\/\/www.simple-talk.com\/author\/kathi-kellenberger\/\">https:\/\/www.simple-talk.com\/author\/kathi-kellenberger\/<\/a>), I discussed the compromise between solutions that were simple to understand, and &#8220;fast enough&#8221;, and the absolute fastest solutions possible. I spent time trying to explain how the latter worked, for the benefit of those with only medium-sized SQL brains.<\/p>\n<p>On several occasions, when dissecting the fastest solution, the author was Peter Larsson. I turned to him for help on this problem, and he devised a very elegant solution that returns the answer in just one table scan instead of six. However, even Peter didn&#8217;t conjure his best solution straight off the bat. His first attempt, show in in Listing 3, reduces the scan count from six to three.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">; \r\nWITH cteSource ( StaffMember, EventDateTime, EventType,  rn ) \r\n        AS ( SELECT   StaffMember , \r\n                      CAST(EventDate AS DATETIME) \r\n                      +   CAST(EventTime AS DATETIME) AS EventDateTime , \r\n                      EventType , \r\n                      ROW_NUMBER() OVER ( PARTITION BY StaffMember \r\n                              ORDER BY EventDate, EventTime ) AS  rn \r\n             FROM dbo.StaffHours \r\n           ), \r\n     cteGrouped ( StaffMember, EventDateTime, grp ) \r\n        AS ( SELECT c.StaffMember , \r\n                    c.EventDateTime , \r\n                    ROW_NUMBER() OVER ( PARTITION BY c.StaffMember \r\n                                 ORDER BY c.EventDateTime ) \r\n                    - 1 AS grp \r\n             FROM cteSource AS  c \r\n                  LEFT JOIN cteSource AS p \r\n                     ON p.StaffMember = c.StaffMember \r\n                        AND p.rn = c.rn - 1 \r\n                  LEFT JOIN cteSource AS n \r\n                     ON n.StaffMember = c.StaffMember \r\n                        AND n.rn = c.rn + 1 \r\n             WHERE ( COALESCE(p.EventType, 'Exit') = 'Exit' \r\n                     AND c.EventType = 'Enter' \r\n                   ) \r\n                   OR ( c.EventType = 'Exit' \r\n                        AND COALESCE(n.EventType, 'Enter') = 'Enter' \r\n                       ) \r\n            ) \r\n   SELECT StaffMember , \r\n          MIN(EventDateTime) AS EventDateTime , \r\n          MAX(EventDateTime) AS ExitDateTime , \r\n          DATEDIFF(SECOND, MIN(EventDateTime), MAX(EventDateTime)) \r\n                                                         AS Seconds ,\r\n          CONVERT(CHAR(8), DATEADD(SECOND, \r\n                               DATEDIFF(SECOND, MIN(EventDateTime),\r\n                                  MAX(EventDateTime)), \r\n                                     '00:00:00'), 8) AS WorkTime \r\n   FROM cteGrouped \r\n   GROUP BY StaffMember , \r\n            grp \/ 2 \r\n   HAVING COUNT(*) = 2;<\/pre>\n<p class=\"caption\">Listing 3: Peter&#8217;s First Solution Eliminates One Self-Join<\/p>\n<p>Peter&#8217;s initial solution contains the two self-joins to determine the previous and next event types just as my solution does, but he manages to eliminate the final self-join (level 4 in my solution), and with it half the scans.<\/p>\n<p>He achieved this by creating groups of rows representing each <code>Enter<\/code> and <code>Exit<\/code> pair, using row numbers generated in the <code>cteGrouped<\/code> level. At this level, he also used a simplified <code>WHERE<\/code> clause to eliminate most of the invalid rows, keeping every <code>Enter<\/code> row that had a previous <code>Exit<\/code> row and every <code>Exit<\/code> row with a subsequent <code>Enter<\/code> row. Figure 4 shows how the data looks at the <code>cteGrouped<\/code> stage.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-ad625675-fde7-4bcf-86bc-9f0e959beafb.png\" alt=\"1767-1-ad625675-fde7-4bcf-86bc-9f0e959be\" \/><\/p>\n<p class=\"caption\">Figure 4: The values from <code>cteGrouped<\/code><\/p>\n<p>Notice that the value of <code>grp<\/code> for all of the remaining <code>Enter<\/code> rows are even, and the remaining <code>Exit<\/code> rows are odd. Integer division throws away remainders, so when the <code>grp<\/code> values are divided by two we end up with group 0 for the first enter\/exit pair, group 1 for the second enter\/exit pair and so on. This observation is important because it means we can now find each enter\/exit pair using that grouping, in the final step. Use of <code>Having COUNT(*) = 2<\/code> ensures that we are working with pairs of rows and not any orphaned <code>Exit<\/code> events at the beginning, or <code>Enter<\/code> events at the end.<\/p>\n<p>The other point to remember, at this stage, is that the <code>Enter<\/code> date and time is the <code>MINIMUM<\/code> for the group, and the <code>Exit<\/code> date and time is the <code>MAXIMUM<\/code> for the group. Peter takes advantage of this in the final step to match up the <code>Enter<\/code> and <code>Exit<\/code> times without doing another self-join. Remember that the final self-join, if performed, doubles the work with six scans of the data.<\/p>\n<p>In the final step, the query performs the grouping as well as the calculations. Peter uses an interesting technique to do the calculation.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">... \r\n   CONVERT(CHAR(8), \r\n      DATEADD(SECOND, DATEDIFF(SECOND, MIN(EventDateTime), \r\n         MAX(EventDateTime)), \r\n            '00:00:00'), 8) AS WorkTime ...<\/pre>\n<p>The <code>DATEDIFF<\/code> calculates the seconds between the <code>MIN(EventDateTime)<\/code>, or <code>ENTER<\/code> time, and the <code>MAX(EventDateTime)<\/code>, or <code>Exit<\/code>. He then adds those seconds back to time zero (<code>00:00:00<\/code>). He uses the style 8 (<code>hh:mm:ss<\/code>) to format the value returned. Often, people use the <code>CAST<\/code> function in preference to the <code>CONVERT<\/code> function, but the benefit of the latter is that you can specify a style that formats the date.<\/p>\n<p>However, not happy with three scans of the table, Peter tried again to improve the performance and arrived at a masterful &#8220;Islands and Gaps&#8221; solution that eliminated the initial self-joins, thus reducing the table scans to just one!<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">; \r\nWITH cteIslands ( StaffMember, EventType, EventDateTime, grp ) \r\n       AS ( SELECT StaffMember , \r\n                 EventType , \r\n                 CAST(EventDate AS DATETIME) \r\n                 +   CAST(EventTime AS DATETIME) AS EventDateTime , \r\n                 ROW_NUMBER() \r\n                    OVER ( ORDER BY StaffMember, EventDate, EventTime ) \r\n                 - ROW_NUMBER() \r\n                    OVER ( ORDER BY EventType, StaffMember, \r\n                                    EventDate, EventTime ) AS grp \r\n             FROM dbo.StaffHours \r\n           ), \r\n     cteGrouped ( StaffMember, EventType, EventDateTime ) \r\n      AS ( SELECT StaffMember , \r\n                  MIN(EventType) , \r\n                  CASE WHEN MIN(EventType) = 'Enter' \r\n                       THEN MIN(EventDateTime) \r\n                       ELSE MAX(EventDateTime) \r\n                  END AS EventDateTime \r\n           FROM cteIslands \r\n           GROUP BY StaffMember , \r\n                    grp \r\n         ), \r\n     cteResult ( StaffMember, EventType, EventDateTime, grp ) \r\n       AS ( SELECT StaffMember , \r\n                   EventType , \r\n                   EventDateTime , \r\n                   ROW_NUMBER() OVER ( PARTITION BY StaffMember ORDER BY EventDateTime ) \r\n                   - 1 AS grp \r\n            FROM cteGrouped \r\n          ) \r\n   SELECT StaffMember , \r\n          MIN(EventDateTime) AS EventDateTime , \r\n          MAX(EventDateTime) AS ExitDateTime , \r\n          DATEDIFF(SECOND, MIN(EventDateTime), MAX(EventDateTime)) AS Seconds ,\r\n          CONVERT(CHAR(8), DATEADD(SECOND, \r\n                           DATEDIFF(SECOND, MIN(EventDateTime),\r\n                                    MAX(EventDateTime)), '00:00:00'), 8) AS WorkTime \r\n   FROM cteResult \r\n   GROUP BY StaffMember , \r\n            ( grp - CASE WHEN EventType = 'Exit' \r\n                              AND Grp = 0 THEN NULL \r\n                         ELSE 0 \r\n                    END ) \/ 2 \r\n   HAVING COUNT(*) = 2<\/pre>\n<p class=\"caption\">Listing 4: The Islands and Gaps Pre-2012 Solution<\/p>\n<p>The first CTE, <code>cteIslands<\/code>, applies row numbers to the data sorted in two different ways and subtracts the two values. In order to see what is going on more easily, let&#8217;s extract and execute the query from <code>cteIslands<\/code> and look at the results it returns.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">SELECT StaffMember , \r\n       EventType , \r\n       CAST(EventDate AS DATETIME) +   CAST(EventTime AS DATETIME) AS EventDateTime , \r\n       ROW_NUMBER() OVER ( ORDER BY StaffMember, EventDate, EventTime ) AS ByTime , \r\n       ROW_NUMBER() OVER ( ORDER BY EventType, StaffMember, EventDate, EventTime ) AS ByType , \r\n       ROW_NUMBER() OVER ( ORDER BY StaffMember, EventDate, EventTime ) \r\n       - ROW_NUMBER() OVER ( ORDER BY EventType, StaffMember, EventDate, EventTime) AS grp \r\nFROM dbo.StaffHours;<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-5c4b990a-c3bb-4fe0-b4af-77a7ea677de1.png\" alt=\"1767-5c4b990a-c3bb-4fe0-b4af-77a7ea677de\" \/><\/p>\n<p class=\"caption\">Listing 5: Viewing the Islands<\/p>\n<p>Listing 5 applies the <code>ByTime<\/code> row numbers in order of <code>StaffMember<\/code>, and then <code>EventDateTime<\/code>. It applies the <code>ByType<\/code> row numbers in order of <code>EventType<\/code>, <code>StaffMember<\/code> and then <code>EventDateTime<\/code>. If an <code>Exit<\/code> event follows another <code>Exit<\/code> event, they will end up in the same <code>GRP<\/code> (<code>ByTime - ByType<\/code>). The same goes for <code>Enter<\/code> events.<\/p>\n<p>If the data were perfect, an <code>Exit<\/code> event would follow every <code>Enter<\/code> event. If that had been the case, all the <code>ByTime<\/code> values would have odd numbers for the <code>Enter<\/code> rows, and even numbers for the <code>Exit<\/code> rows and so the difference between the <code>ByTime<\/code> and <code>ByType<\/code> values would have increased by one for each <code>Enter<\/code> row, and increased by one for each <code>Exit<\/code> row, thus giving us no duplicate <code>GRP<\/code> numbers. Figure 6 shows how the results of Listing 5 would look assuming ideal data.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-f8678a92-0c9c-4750-9715-4258df364d75.png\" alt=\"1767-1-f8678a92-0c9c-4750-9715-4258df364\" \/><\/p>\n<p><strong>Figure 6: <\/strong><strong>Results of Listing 5, if<\/strong><strong> the data <\/strong><strong>were<\/strong><strong> perfect<\/strong><\/p>\n<p>In our imperfect data, because two adjacent <code>Exit<\/code> events will increase both <code>ByTime<\/code> and <code>ByType<\/code> by one (Figure 5), the difference between <code>ByTime<\/code> and <code>ByType<\/code> for the two rows are equivalent, and those two adjacent rows become a group (an &#8220;island&#8221;). This enables us to use <code>MIN<\/code> for <code>Enter<\/code> rows and <code>MAX<\/code> for <code>Exit<\/code> rows to find the valid row for each <code>GRP<\/code>.<\/p>\n<p>The remainder of this solution uses the same techniques as Listing 3 to pair <code>Enter<\/code> and <code>Exit<\/code> rows, and to calculate the difference in time.<\/p>\n<div class=\"tips\">\n<p><em>Note that Peter submitted a further iteration to his solution, which uses the same basic algorithm but with refinements, including small bug fixes for two edge conditions. It is available as part of the code download at the bottom of the article.<\/em><\/p>\n<\/div>\n<h1>SQL Server 2012-only Solution: New Analytic Functions<\/h1>\n<p>While this solution performs far better than my original attempt and will perform well on all SQL Server versions, up to and including SQL Server 2012, a solution that is easier to develop, and has equal performance, takes advantage of some new functions available in SQL Server 2012.<\/p>\n<p>SQL Server 2012 brings us several new functions, three of which will be very helpful here: <code>LAG<\/code>, <code>LEAD<\/code>, and <code>TIMEFROMPARTS<\/code>. <code>LAG<\/code> and <code>LEAD<\/code> are two new analytic window functions. <code>TIMEFROMPARTS<\/code> is one of several new functions used to build a particular date and time with the parts.<\/p>\n<p>In our previous solution, we created partition &#8220;windows&#8221; based on <code>StaffMember<\/code>, and used the <code>ROW_NUMBER<\/code> function to number sequentially each row in each partition (Listing 2). However, we then had to perform two extra table scans to get the event type values for the previous and next rows. The new <code>LAG<\/code> and <code>LEAD<\/code> functions allow us to retrieve these event type values from each &#8220;window&#8221; without needing to apply row numbers first. The <code>LAG<\/code> function returns the previous value from a particular column without rescanning the table. With <code>LEAD<\/code>, we can find the next value. Instead of rescanning, SQL Server performs the operation on data already in memory. This means fewer scans of the data and fewer logical reads.<\/p>\n<p>The <code>LAG<\/code> and <code>LEAD<\/code> functions each accept three parameters. The first parameter specifies the column for the value we wish to retrieve. The second parameter is optional and is the number of rows to go back or forward with the default of one. The third parameter, also optional, defines a default value in case a <code>NULL<\/code> is returned.<\/p>\n<p>Listing 6 shows the code using these functions to return the previous and next values of the event type. Notice that we also return <code>N\/A<\/code> instead of <code>NULL<\/code> and so render redundant the <code>COALESCE<\/code> logic we previously added for this task.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">SELECT StaffMember , \r\n       EventType , \r\n       CAST(EventDate AS DATETIME) + CAST(EventTime AS DATETIME) AS EventDateTime , \r\n       LAG(EventType, 1, 'N\/A') OVER ( PARTITION BY StaffMember ORDER BY EventDate, EventTime ) AS LastEvent , \r\n       LEAD(EventType, 1, 'N\/A') OVER ( PARTITION BY StaffMember ORDER BY EventDate, EventTime ) AS NextEvent \r\nFROM StaffHours;\r\nGO<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-63bbb995-b857-4a61-9dd5-65aa955d4030.png\" alt=\"1767-1-63bbb995-b857-4a61-9dd5-65aa955d4\" \/><\/p>\n<p class=\"caption\">Listing 6: Using <code>LAG<\/code> and <code>LEAD<\/code> to find the previous and next event type values<\/p>\n<p>Critically, we&#8217;ve achieved this with only one scan of the table, instead of three. The next step is to filter out the unneeded rows. SQL Server processes the <code>WHERE<\/code> clause <em>before<\/em> the <code>LAG<\/code> and <code>LEAD<\/code> functions are processed, so we&#8217;ll have to use the CTE strategy again. Listing 7 shows the query for this step.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">; WITH Level1 \r\n    AS ( -- Use LAG and LEAD to find the previous and next event type \r\n         -- values (Listing 6) \r\n        ) \r\n        SELECT StaffMember , \r\n               EventType , \r\n               EventDateTime , \r\n               LastEvent , \r\n               NextEvent \r\n        FROM Level1 \r\n        WHERE NOT ( EventType = 'Enter' \r\n                    AND LastEvent = 'Enter' \r\n                   ) \r\n          AND NOT ( EventType = 'Exit' \r\n                    AND NextEvent = 'Exit' \r\n                   ) \r\n          AND NOT ( EventType = 'Enter' \r\n                    AND NextEvent = 'N\/A' \r\n                   ) \r\n          AND NOT ( EventType = 'Exit' \r\n                    AND LastEvent = 'N\/A' \r\n                   ); \r\nGO<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-d90fffe5-1189-4e9a-874d-a1af5f7126f9.png\" alt=\"1767-1-d90fffe5-1189-4e9a-874d-a1af5f712\" \/><\/p>\n<p class=\"caption\">Listing 7: Filter out unneeded rows<\/p>\n<p>Crucially, we are still at one table scan. We now have the rows we need, but we can&#8217;t do the time calculations because the <code>EventDateTime <\/code>values are on different rows. We can use the <code>LEAD<\/code> function again to access the <code>EventDateTime<\/code> value on the next row. Because we can nest functions, we can use the results of the <code>LEAD<\/code> function inside the <code>DATEDIFF<\/code> function to calculate seconds. Listing 8 shows the additional code for this step.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">;\r\nWITH Level1 \r\n       AS ( \r\n            -- Using LAG and LEAD to find the previous and next event type \r\n            -- values (Listing 6) \r\n           ), \r\n     Level2 \r\n       AS ( \r\n            -- Filter out unneeded rows (Listing 7) \r\n           ) \r\n  SELECT StaffMember , \r\n         EventType , \r\n         EventDateTime , \r\n         DATEDIFF(second, EventDateTime, \r\n                  LEAD(EventDateTime) \r\n                    OVER ( PARTITION BY StaffMember \r\n                       ORDER BY EventDateTime )) AS Seconds \r\n  FROM Level2; \r\nGO<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-4572ce3a-30b5-4f23-ba5d-ec4a395a5843.png\" alt=\"1767-1-4572ce3a-30b5-4f23-ba5d-ec4a395a5\" \/><\/p>\n<p class=\"caption\">Listing 8: Calculate work session duration in seconds<\/p>\n<p>The final step is to turn Seconds into hours, minutes, and seconds. We can use the new <code>TIMEFROMPARTS<\/code> functions to simplify the formula. Each date or time data type in SQL Server 2012 has a corresponding <code>*FROMPARTS<\/code> function. These allow us to build a date from its component parts. There is no casting or appending. Finally, we filter to return just the <code>Enter<\/code> rows. Listing 9 shows the final 2012-only code.<\/p>\n<pre class=\"lang:tsql theme:ssms2012\">; \r\nWITH Level1 \r\n       AS ( SELECT StaffMember , \r\n                   EventType , \r\n                   CAST(EventDate AS DATETIME) \r\n                   +   CAST(EventTime AS DATETIME) AS EventDateTime , \r\n                   LAG(EventType, 1, 'N\/A') \r\n                      OVER ( PARTITION BY StaffMember \r\n                         ORDER BY EventDAGate, EventTime ) AS LastEvent , \r\n                   LEAD(EventType, 1, 'N\/A') \r\n                      OVER ( PARTITION BY StaffMember \r\n                         ORDER BY EventDate, EventTime ) AS NextEvent \r\n            FROM StaffHours ), \r\n     Level2 \r\n       AS ( SELECT StaffMember , \r\n                   EventType , \r\n                   EventDateTime , \r\n                   LastEvent , \r\n                   NextEvent \r\n            FROM   Level1 \r\n            WHERE  NOT ( EventType = 'Enter' \r\n                         AND LastEvent = 'Enter' \r\n                       ) \r\n               AND NOT ( EventType = 'Exit' \r\n                         AND NextEvent = 'Exit' \r\n                       ) \r\n               AND NOT ( EventType = 'Enter' \r\n                         AND NextEvent = 'N\/A' \r\n                       ) \r\n               AND NOT ( EventType = 'Exit' \r\n                         AND LastEvent = 'N\/A' \r\n                       ) \r\n          ), \r\n     Level3 \r\n       AS ( SELECT StaffMember , \r\n                   EventType , \r\n                   EventDateTime , \r\n                   DATEDIFF(second, EventDateTime, \r\n                            LEAD(EventDateTime) \r\n                               OVER ( PARTITION BY StaffMember \r\n                                  ORDER BY EventDateTime )) AS Seconds \r\n            FROM Level2 \r\n           ) \r\n   SELECT StaffMember , \r\n          EventDateTime , \r\n          TIMEFROMPARTS(Seconds \/ 3600, Seconds % 3600 \/ 60, \r\n                        Seconds % 3600 % 60, 0, 0) AS WorkTime \r\n   FROM Level3 \r\n   WHERE EventType = 'Enter';<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/1767-1-7f0a843b-581b-44e5-8558-277c94710ecb.png\" alt=\"1767-1-7f0a843b-581b-44e5-8558-277c94710\" \/><\/p>\n<p class=\"caption\">Listing 9: The final SQL 2012-only code<\/p>\n<p>Here are the results, just as expected. Thanks to the new functions in SQL Server 2012, the query is simpler and easier to write and is much more efficient, scanning the data just one time. If you are working with SQL Server 2012, keep the new functions in mind to help you solve those tricky queries. If you&#8217;re interested in seeing Peter&#8217;s SQL 2012-only solution, which is similar to mine but with some refinements, it&#8217;s available in the code download file.<\/p>\n<h1>Summary<\/h1>\n<p>Writing T-SQL queries to solve complex problems is often difficult. Frequently, I will assist others, either on forums or by direct request, in solving those problems. My methodology involves breaking the problem down and solving it systematically. At each step, I examine the results, adjust if necessary, and move on to the next step. I don&#8217;t expect to get the right answer right off the bat; writing a complex query takes time. I hope that the individual asking for help will not just take the answer, but will learn something and be better able to handle the next challenging query that comes along. The same applies to me, of course. Peter managed to find a more efficient pre-2012 solution and I can take what I learned from that and apply to future problems.<\/p>\n<p>Luckily for SQL Server developers, T-SQL functionality has improved immensely starting with the introduction of the ranking functions and the <code>OVER<\/code> clause in 2005, and continuing with the introduction of very useful analytic functions, such as <code>LAG<\/code> and <code>LEAD<\/code>, in SQL Server 2012. If you haven&#8217;t kept up with these improvements, you owe it to yourself to spend some time learning about them. I found that once I started working with such functions, and Common Table Expressions, I relied on them more and more.<\/p>\n<p>In particular, if you are fortunate enough to run SQL Server 2012, be sure to learn about the new analytic and other functions. The improved performance in many queries may be the tipping point to get your organization to upgrade!<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>What should you do if your first, most intuitive solution to a problem ends up scanning the data more than is necessary, resulting in poor performance? Have you missed a new SQL Server feature that can remove inefficiency from your technique? Alternatively, do you need a little help, and some lateral thinking, to open the path to a different approach? Sometimes, the answer is &#8220;both&#8221;.&hellip;<\/p>\n","protected":false},"author":110218,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143531],"tags":[4150,4151,4183,4252],"coauthors":[11292],"class_list":["post-1606","post","type-post","status-publish","format-standard","hentry","category-t-sql-programming-sql-server","tag-sql","tag-sql-server","tag-t-sql","tag-t-sql-programming"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/1606","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/110218"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=1606"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/1606\/revisions"}],"predecessor-version":[{"id":76220,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/1606\/revisions\/76220"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=1606"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=1606"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=1606"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=1606"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}