{"id":101783,"date":"2024-05-27T19:24:25","date_gmt":"2024-05-27T19:24:25","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=101783"},"modified":"2025-02-05T11:23:44","modified_gmt":"2025-02-05T11:23:44","slug":"snake-draft-sorting-in-sql-server-part-4","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/snake-draft-sorting-in-sql-server-part-4\/","title":{"rendered":"Snake Draft Sorting in SQL Server Part 4: Coordinating Parallel Job Completion"},"content":{"rendered":"<div style=\"color: black; padding: 4px 8px; margin: 4px 0 12px 0; border: 1px solid #e00; background: #f7f7f7; font-size: 0.9rem;\">Part of a series: [\u00a0 <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/snake-draft-sorting-in-sql-server-part-1\" rel=\"noopener\">Part 1<\/a> \u00a0|\u00a0 <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/snake-draft-sorting-in-sql-server-part-2\" rel=\"noopener\">Part 2<\/a> \u00a0|\u00a0 <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/snake-draft-sorting-in-sql-server-part-3\" rel=\"noopener\">Part 3<\/a> \u00a0|\u00a0 <b style=\"color: #999;\">Part 4<\/b> \u00a0]<\/div>\n<p>In the previous posts in this series, I described how I have optimized a long-running set of routines by processing databases, tables, and even subsets of tables in parallel. This leads to many separate jobs that all kick off at roughly the same time, but because it is difficult to get a perfect balance of work across the jobs, the end times are jagged and unpredictable:<\/p>\n<p><img decoding=\"async\" style=\"width: 90%; margin: -5px 10px; min-width: 320px;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/snake-4-1.png\" \/><\/p>\n<p>Imagine you&#8217;ve split up this chunk of work so that, collectively, it finishes much faster. When all of the threads are done, you want to kick off the next round of work. For example, if the threads are writing data to flat files, when they&#8217;re <em>all<\/em> done, you might want to then kick off a process that zips them, or moves them, or bulk loads them into a different system, or all of the above.<\/p>\n<p>When you have a <em>single<\/em> job that depends on the completion of a <em>single<\/em> previous job, it is quite easy to chain them. The last step of the first job simply calls <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">sp_start_job<\/code> for the subsequent job (or they&#8217;re simply steps in a single job).<\/p>\n<p>When you have one of more jobs that depend on the completion of <em>multiple<\/em> previous jobs, it can be a little harder to synchronize everything. Many people accomplish this by looking at past runtimes and scheduling in a buffer. For example, if the jobs typically take two hours to complete, they&#8217;ll schedule the next job(s) to start three hours after the start of the first set.<\/p>\n<p><img decoding=\"async\" style=\"width: 90%; margin: 0 10px; min-width: 320px;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/snake-4-2.png\" \/><\/p>\n<p>You might notice that this introduces two problems:<\/p>\n<ol>\n<li><b>Wasted time<\/b>. If the last job in the first round finishes at the expected time (or earlier), then there is at least an hour of idle time:<img decoding=\"async\" style=\"width: 82%; margin: 0 10px; min-width: 320px;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/snake-4-3.png\" \/><\/li>\n<li><b>Conflict<\/b>. If any job in the first round runs <em>longer<\/em> than expected, the next job(s) will start too early:<img decoding=\"async\" style=\"width: 82%; margin: 0 10px; min-width: 320px;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/snake-4-4.png\" \/><\/li>\n<\/ol>\n<p>Both of these scenarios are problematic. Obviously we want the second job to finish as early as possible, and that can&#8217;t happen if it&#8217;s sitting around waiting for nothing. And if the second job <em>starts<\/em> too early, it could be operating on data that isn&#8217;t ready &#8211; or even present.<\/p>\n<h3>A localized solution<\/h3>\n<p>I&#8217;m not going to pretend to have a magic, comprehensive answer for a complex job management solution, but I can show you exactly how I solved this problem for my scenario. My goal was simple: to ensure the next round of work starts right when it should &#8211; no sooner, no later.<\/p>\n<p><img decoding=\"async\" style=\"width: 90%; margin: 0 10px 0 0; min-width: 320px;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/snake-4-5.png\" \/><\/p>\n<p>The first thought I had was this: why not just have job 2 check if all the jobs in the first round have finished (successfully, of course)? The problem with that is, if the answer is no, then what? Sleep for 10 seconds and try again? This is not a great solution &#8211; even if it uses no resources to loop over and over again until the answer is yes, it looks to DBAs and monitoring tools like the job is running, while at least one of the previous jobs is <em>also<\/em> running. It also skews metrics involving job runtimes, if you use that information for anything (I do).<\/p>\n<p>A more logical approach is this: have each job in the first round check, as their last step, if the other jobs in the first round have also finished. If the answer is no, quit with success. If the answer is yes, run <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">sp_start_job<\/code>, and quit with success. This way, the second job won&#8217;t start until the <em>last<\/em> job in the first round has finished.<\/p>\n<p><img decoding=\"async\" style=\"width: 90%; margin: 0 10px 0 0; min-width: 320px;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2024\/03\/snake-4-6.png\" \/><\/p>\n<p>Before I share the code to handle this, there are some caveats here. We need to:<\/p>\n<ul>\n<li>&#8230;<b>check for <em>success<\/em><\/b>, not just completion. If one of the other jobs <em>failed<\/em>, we probably want to raise an alarm.<\/li>\n<li>&#8230;<b>check<\/b> that the successful job started and finished <b>within a reasonable time<\/b>. If it completed yesterday or last week, or ran for five days, we probably want to sound an alarm.<\/li>\n<li>&#8230;<b>handle ties<\/b>. If the last two jobs finish at the exact same time, they might both try to start the second round job, but only one will win. We need to fail gracefully and <em>not<\/em> sound an alarm in this case.<\/li>\n<li>&#8230;<b>wait and verify<\/b> that the second job was successfully started, and that it wasn&#8217;t already running <em>before<\/em> we got to this step.<\/li>\n<li>&#8230;<b>have <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/ssms\/agent\/use-tokens-in-job-steps?view=sql-server-ver16\" target=\"_blank\" rel=\"noopener\">SQL Server Agent tokens<\/a> enabled<\/b>. This isn&#8217;t <em>critical<\/em>, but I use those to identify the current job dynamically, without having to hard-code different names in each job.<\/li>\n<\/ul>\n<p>I&#8217;m going to share the code in pieces, with some additional context, then the entire job step code in a single block at the end.<\/p>\n<p>We&#8217;ll start with some control variables &#8211; how many threads, what &#8220;recent&#8221; means, the names of the first round jobs, the name of the second round job, and the current SQL Server Agent session (to ignore abandoned sessions since the last service restart that <em>look like<\/em> they&#8217;re still running). I&#8217;m also assuming your server is running in UTC &#8211; if not, you can use <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">GETDATE()<\/code>, but this can get complicated if your server also observes Daylight Saving Time.<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> DECLARE @ThreadCount               int = 4,\n         @JobStartThresholdMinutes  int = 720, \/* 12 hours *\/\n         @JobFinishThresholdMinutes int = 60,  \/* 1 hour   *\/\n         @CurrentAgentSession       int,\n         @now                 datetime      = GETUTCDATE(),\n         @NextJobName         nvarchar(128) = N'Job 2',\n         @ThisJobName         nvarchar(128) = N'$(ESCAPE_SQUOTE(JOBNAME))',\n         @JobNamePattern      nvarchar(128) = N'Job 1[a-d]';\n\n DECLARE @start datetime = DATEADD(MINUTE, -@JobStartThresholdMinutes,  @now),\n         @end   datetime = DATEADD(MINUTE, -@JobFinishThresholdMinutes, @now);\n\n SELECT @CurrentAgentSession = MAX([session_id]) FROM msdb.dbo.syssessions;<\/pre>\n<p><em style=\"color: #bbb;\">You could store some of these control variables in a table somewhere, instead of having them in each job step. I&#8217;m just putting them in the step for brevity.<\/em><\/p>\n<p>Next, let&#8217;s put the jobs into a table variable both to simplify joins and also as an early sniff test to make sure the right number of jobs even exist.<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> DECLARE @jobs table\n (\n   job_id uniqueidentifier, \n   [name] nvarchar(128)\n );\n \n INSERT @jobs(job_id, name) \n   SELECT job_id, name\n     FROM msdb.dbo.sysjobs\n     WHERE name LIKE @JobNamePattern\n       AND name &lt;&gt; @ThisJobName;\n \n IF @@ROWCOUNT &lt;&gt; @ThreadCount - 1\n BEGIN\n   RAISERROR('Wrong number of jobs found.', 11, 1);\n END<\/pre>\n<p>Next, we check how many of the jobs from the same round are currently running. If the count is not zero, raise an error and don&#8217;t continue:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> ELSE\n BEGIN\n   DECLARE @RunningCount int;\n \n   SELECT @RunningCount = COUNT(*)\n     FROM @jobs AS j\n     WHERE EXISTS\n     ( \n       SELECT 1 FROM msdb.dbo.sysjobactivity\n         WHERE start_execution_date &gt;= @start\n           AND stop_execution_date IS NULL\n           AND [session_id] = @CurrentAgentSession \n           AND job_id = j.job_id\n     );\n \n   IF @RunningCount &gt; 0\n   BEGIN\n     RAISERROR('%d job(s) still running.', 11, 1, @RunningCount);\n   END<\/pre>\n<p>After that, we check to make sure that the other jobs that <em>have<\/em> completed have done so successfully &#8211; and in the time range we expect. If the data doesn&#8217;t jive, raise an error and don&#8217;t continue:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\">   ELSE\n   BEGIN\n     DECLARE @CompletedCount int;\n \n     SELECT @CompletedCount = COUNT(*)\n       FROM @jobs AS j\n       INNER JOIN \n       (\n         SELECT job_id, instance_id = MAX(instance_id)\n           FROM msdb.dbo.sysjobhistory\n           WHERE run_status = 1\n             AND step_id = 0\n           GROUP BY job_id\n       ) AS h\n       ON j.job_id = h.job_id\n       INNER JOIN msdb.dbo.sysjobactivity AS ja\n       ON j.job_id = ja.job_id\n         AND ja.job_history_id = h.instance_id\n       WHERE ja.start_execution_date &gt;= @start\n         AND ja.stop_execution_date &gt;= @end;\n     \n     IF @CompletedCount &lt;&gt; @ThreadCount - 1\n     BEGIN\n       RAISERROR('Only %d job(s) completed.', 11, 1, @CompletedCount);\n     END<\/pre>\n<p>If we passed all those tests, it&#8217;s time to start our next job, wait a few seconds, and check to make sure it is running:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\">     ELSE\n     BEGIN  \n       EXEC msdb.dbo.sp_start_job @job_name = @NextJobName;\n       \n       WAITFOR DELAY '00:00:10';\n\n       IF NOT EXISTS \n       (\n         SELECT 1 \n           FROM msdb.dbo.sysjobs AS j\n           INNER JOIN msdb.dbo.sysjobactivity AS ja\n              ON ja.job_id = j.job_id\n           WHERE j.name = @NextJobName\n             AND start_execution_date &gt;= DATEADD(MINUTE, -5, @now)\n             AND stop_execution_date IS NULL\n       )\n       BEGIN\n         RAISERROR('%s is not running!', 11, 1, @NextJobName);\n       END\n     END\n   END\n END<\/pre>\n<p>That&#8217;s it! Now my threaded jobs can all complete at their own pace, and the subsequent job waits &#8211; not firing until immediately after the last job finishes. There are probably some edge cases lingering here that I haven&#8217;t handled yet, but I&#8217;ll deal with those as they come.<\/p>\n<hr \/>\n<h3>Appendix: The whole code block<\/h3>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> DECLARE @ThreadCount               int = 4,\n         @JobStartThresholdMinutes  int = 720, \/* 12 hours *\/\n         @JobFinishThresholdMinutes int = 60,  \/* 1 hour   *\/\n         @CurrentAgentSession       int,\n         @now                 datetime      = GETUTCDATE(),\n         @NextJobName         nvarchar(128) = N'Job 2',\n         @ThisJobName         nvarchar(128) = N'$(ESCAPE_SQUOTE(JOBNAME))',\n         @JobNamePattern      nvarchar(128) = N'Job 1[a-d]';\n\n DECLARE @start datetime = DATEADD(MINUTE, -@JobStartThresholdMinutes,  @now),\n         @end   datetime = DATEADD(MINUTE, -@JobFinishThresholdMinutes, @now);\n\n SELECT @CurrentAgentSession = MAX([session_id]) FROM msdb.dbo.syssessions;\n\n DECLARE @jobs table\n (\n   job_id uniqueidentifier, \n   [name] nvarchar(128)\n );\n \n INSERT @jobs(job_id, name) \n   SELECT job_id, name\n     FROM msdb.dbo.sysjobs\n     WHERE name LIKE @JobNamePattern\n       AND name &lt;&gt; @ThisJobName;\n \n IF @@ROWCOUNT &lt;&gt; @ThreadCount - 1\n BEGIN\n   RAISERROR('Wrong number of jobs found.', 11, 1);\n END\n ELSE\n BEGIN\n   DECLARE @RunningCount int;\n \n   SELECT @RunningCount = COUNT(*)\n     FROM @jobs AS j\n     WHERE EXISTS\n     ( \n       SELECT 1 FROM msdb.dbo.sysjobactivity\n         WHERE start_execution_date &gt;= @start\n           AND stop_execution_date IS NULL\n           AND [session_id] = @CurrentAgentSession \n           AND job_id = j.job_id\n     );\n \n   IF @RunningCount &gt; 0\n   BEGIN\n     RAISERROR('%d job(s) still running.', 11, 1, @RunningCount);\n   END\n   ELSE\n   BEGIN\n     DECLARE @CompletedCount int;\n \n     SELECT @CompletedCount = COUNT(*)\n       FROM @jobs AS j\n       INNER JOIN \n       (\n         SELECT job_id, instance_id = MAX(instance_id)\n           FROM msdb.dbo.sysjobhistory\n           WHERE run_status = 1\n             AND step_id = 0\n           GROUP BY job_id\n       ) AS h\n       ON j.job_id = h.job_id\n       INNER JOIN msdb.dbo.sysjobactivity AS ja\n       ON j.job_id = ja.job_id\n         AND ja.job_history_id = h.instance_id\n       WHERE ja.start_execution_date &gt;= @start\n         AND ja.stop_execution_date &gt;= @end;\n     \n     IF @CompletedCount &lt;&gt; @ThreadCount - 1\n     BEGIN\n       RAISERROR('Only %d job(s) completed.', 11, 1, @CompletedCount);\n     END\n     ELSE\n     BEGIN  \n       EXEC msdb.dbo.sp_start_job @job_name = @NextJobName;\n       \n       WAITFOR DELAY '00:00:10';\n\n       IF NOT EXISTS \n       (\n         SELECT 1 \n           FROM msdb.dbo.sysjobs AS j\n           INNER JOIN msdb.dbo.sysjobactivity AS ja\n              ON ja.job_id = j.job_id\n           WHERE j.name = @NextJobName\n             AND start_execution_date &gt;= DATEADD(MINUTE, -5, @now)\n             AND stop_execution_date IS NULL\n       )\n       BEGIN\n         RAISERROR('%s is not running!', 11, 1, @NextJobName);\n       END\n     END\n   END\n END<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Part 4 of the snake-draft sorting series &#8211; coordinating parallel SQL Agent jobs when work units finish at different times. Covers thread-completion signalling, handling the last worker finishing, and preventing idle time when some queues empty before others. Builds on the serpentine assignment pattern from parts 1-3.&hellip;<\/p>\n","protected":false},"author":341115,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143527,53,143531],"tags":[4151],"coauthors":[158980],"class_list":["post-101783","post","type-post","status-publish","format-standard","hentry","category-database-administration-sql-server","category-featured","category-t-sql-programming-sql-server","tag-sql-server"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101783","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/341115"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=101783"}],"version-history":[{"count":30,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101783\/revisions"}],"predecessor-version":[{"id":105452,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/101783\/revisions\/105452"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=101783"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=101783"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=101783"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=101783"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}