Red Gate forums :: View topic - Intermittent Job Failure alert for stopped jobs
Return to www.red-gate.com RSS Feed Available

Search  | Usergroups |  Profile |  Messages |  Log in  Register 
Go to product documentation
SQL Monitor 3
SQL Monitor 3 forum

Intermittent Job Failure alert for stopped jobs

Search in SQL Monitor 3 forum
Post new topic   Reply to topic
Jump to:  
Go to page 1, 2  Next
Author Message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Wed Aug 08, 2012 7:46 pm    Post subject: Intermittent Job Failure alert for stopped jobs Reply with quote

We have a series of SSIS SQL Agent jobs that are running. Every hour another job runs that recycles the SSIS jobs by running the following:
Code:

USE msdb ;
GO
EXEC dbo.sp_stop_job N'JOB_NAME' ;
GO

USE msdb ;
GO
EXEC dbo.sp_start_job N'JOB_NAME' ;
GO


Looking at the job history for the SSIS jobs, I see this in the step details:
Executed as user: DOMAIN\User. The step was cancelled (stopped) as the result of a stop job request.

I can look at the history and compare that to my "Job failed" alerts in SQL Monitor and there are numerous examples of the monitoring software not showing a failure. This seems to happen in waves. For instance, one day I'll get a bunch of alerts and the next I won't get any.

Is SQL Monitor supposed to log "stopped" jobs as failures? If so, is there a way to change the default behavior? If not, I'd be happy to supply more information as needed to help troubleshoot this further.

Thanks all,
Robbie
Back to top
View user's profile Send private message
Brian Donahue



Joined: 23 Aug 2004
Posts: 6641

PostPosted: Fri Aug 10, 2012 4:33 pm    Post subject: Reply with quote

SQL Monitor records the job_status from the job history. sp_stop_job should end the job gracefully and allow the current jobstep to complete, however, MS warns that if the job was doing some CmdExec or PowerShell command, the outcome could be unpredictable...
http://msdn.microsoft.com/en-us/library/ms182793.aspx

This may be the reason for the inconsistent behavior.
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Fri Aug 10, 2012 5:55 pm    Post subject: Reply with quote

Thanks for the reply.

I really don't see anything different between the runs that show up in sql monitor and those that don't from a job status perspective. Could you tell me about how the software queries for this data?

I'm just looking at the sysjobhistory table and seeing a 3 for run_status. Is the software looking at something different than this for it's information.

Thanks for all your help,

Robbie
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Fri Aug 17, 2012 2:01 pm    Post subject: Reply with quote

Brian,
Are you guys taking a look at this any more?

Thanks,
Robbie
Back to top
View user's profile Send private message
priyasinha



Joined: 03 Jan 2007
Posts: 530

PostPosted: Thu Aug 23, 2012 4:23 pm    Post subject: Reply with quote

Hi Robbie,

I apologize for delay in reply.

SQL Monitor retrieves information from msdb.dbo.sysjobhistory. Then it looks at run_status value and raises an alert for all run_status other than 1. In other words, it alerts you for failed, retry and cancelled jobs.

Is this the behaviour you are seeing in your environment?

Regarding default behaviour, unfortunately there is no way to change this behaviour.

Thanks,
Priya
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Thu Aug 23, 2012 7:08 pm    Post subject: Reply with quote

Priya,
Thanks for the reply. If what you are saying is true, then SQL Monitor is misbehaving for me. As stated in my original post, we are stopping the jobs via another job. This should result in a run_status of 3 for the jobs that were stopped. When I use this query I get 188 rows returned:
Code:

select step_name, run_status, run_date, run_time
from sysjobhistory
where run_status <> 1 and run_date = '20120822'
   and step_name <> '(Job outcome)'

SQL Monitor only registered 22 job failures yesterday (22nd of August). I can look at some of the jobs that SQL Monitor is showing as failed and see that the run_status was indeed 3....however, there are many jobs with a run_status = 3 that did not get alerts yesterday.

When SQL Monitor does report on a run_status = 3, it tends to report on all the jobs that were just set to that status as a result of the last stop jobs run. Most of the time SQL Monitor only reports on jobs that have a run_status = 0 (not sure on 2, since we don't get a lot of those).

Is there anything in the code that would explain this behavior? Could this be a timing issue of some sort?

Thanks for your help,
Robbie
Back to top
View user's profile Send private message
priyasinha



Joined: 03 Jan 2007
Posts: 530

PostPosted: Fri Aug 24, 2012 2:35 pm    Post subject: Reply with quote

Hi Robbie.

Thanks for the reply.

Could you please run the following query on Monitor database for the job which is causing problem?

Code:
select count(*) from [SqlMonitorDatabase].[data].[Cluster_SqlServer_Agent_Job_History_Instances_View] where Cluster_Name = '[MachineName]' and Cluster_SqlServer_Name = '[SQLServerInstanceName]' and Cluster_SqlServer_Agent_Job_Name = '[JobName]' and Cluster_SqlServer_Agent_Job_History_RunDate_DateTime > '2012-08-21'


Then run your query (bit modified):

Code:
select  count(*)
from msdb.dbo.sysjobhistory
where run_date > '20120821'
   and step_name <> '(Job outcome)' and job_id = 'jobid'


Could you let us know how many rows each query return?

Thanks,
Priya
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Fri Aug 24, 2012 3:11 pm    Post subject: Reply with quote

I changed your 2nd query a little bit. Changed the > to >=. If I ran your original code I wasn't getting any records from 8/21.
Code:

select count(*)
from msdb.dbo.sysjobhistory
where run_date >= '20120821'
   and step_name <> '(Job outcome)' and job_id = 'jobid'


Anyway, with the revised code, I got 38 for both queries. So, they seem to match at least.

I dug into the data a bit. For this particular job there were 37 times that it had a run_status of 3 (stopped) and 1 time run_status of 0 (failed). Both msdb and the SQL Monitor database match on these values.

I don't know how to dig into the alert.Alert table in the SQL Monitor database, but I did look through the UI. I only received 3 alerts for the job in question.

Thanks for digging into this with me.

Robbie
Back to top
View user's profile Send private message
priyasinha



Joined: 03 Jan 2007
Posts: 530

PostPosted: Fri Aug 24, 2012 3:39 pm    Post subject: Reply with quote

Hi Robbie,

What is your purge setting?

Thanks,
Priya
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Fri Aug 24, 2012 3:42 pm    Post subject: Reply with quote

Basic machine data: 2 weeks
Windows process data: 1 week
Basic SQL Server data: 2 weeks
SQL process data: 1 week
Top 10 queries data: 1 week
Database performance counter data: 2 weeks
Custom metric data: 1 week
Basic alert data: 1 month
SQL Trace Data: 3 days
Back to top
View user's profile Send private message
priyasinha



Joined: 03 Jan 2007
Posts: 530

PostPosted: Wed Sep 05, 2012 3:44 pm    Post subject: Reply with quote

Hi Robbie,

Apologies for delay in reply. Could you please run the following query in your environment?

Counts the number of unsuccessful job runs in Monitor database since yesterday

Code:
select COUNT(*) from [data].[Cluster_SqlServer_Agent_Job_History_Instances_View] where Cluster_Name = [MachineName] and Cluster_SqlServer_Name = [InstanceName] and Cluster_SqlServer_Agent_Job_History_RunDate_DateTime > = '2012-09-04' and Cluster_SqlServer_Agent_Job_History_RunStatus <> 1


Counts the number of unsuccessful runs in sysjobhistory since yesterday. This is to make sure that we have collected the job history correctly. You may have to verify whether SQL Server enters the date in this table in UTC or local date time. If it is local date time then you may have to change query to accommodate your time zone changes. Everything in Monitor database is in UTC .

Code:
   select count(*) from msdb.dbo.sysjobhistory
where run_status <> 1 and run_date >= '20120904'
   and step_name <> '(Job outcome)'


Will show the number of each type of alerts which has been raised since yesterday grouped by server.

Code:
WITH   target_obj
              AS ( SELECT  [c].Name AS [clustername] ,
                                         '7:Cluster,1,4:Name,s'
                                         + CAST(LEN([c].[Name]) AS VARCHAR(10)) + ':'
                                         + [c].[Name] + ',' AS to_Cluster ,
                                         ss.Name AS [servername] ,
                                         '9:SqlServer,1,4:Name,s'
                                         + CAST(LEN([ss].[Name]) AS VARCHAR(10)) + ':'
                                         + [ss].[Name] AS to_SQLServer
                       FROM        [settings].[Clusters] AS c
                                         INNER JOIN [settings].[SqlServers] AS ss ON [c].[Id] = [ss].[ParentId]
                     )
       SELECT [t2].[clustername] AS [Cluster] ,
                     t2.[servername] AS [SQLServer] ,
                     ISNULL(cad.[Name], at2.[Name]) AS [Alert Name] ,
                     COUNT(*) AS [Alert Count]
       FROM   [alert].[Alert_Current] AS a
                     INNER JOIN [alert].[Alert_Type] AS at2 ON [a].[AlertType] = [at2].[AlertType]
                     LEFT JOIN [settings].[CustomAlertDefinitions] AS cad ON [a].[SubType] = [cad].[Id]
                     CROSS  APPLY ( SELECT      [t].[clustername] ,
                                                              [t].[to_Cluster] ,
                                                              [t].[servername] ,
                                                              [t].[to_SQLServer] ,
                                                              [t].[to_Cluster] + [t].[to_SQLServer] AS to_cluster_sqlserver
                                           FROM        [target_obj] AS t
                                           WHERE       LEFT([a].[TargetObject],
                                                                     LEN([t].[to_Cluster]
                                                                           + [t].[to_SQLServer])) COLLATE database_default = [t].[to_Cluster]
                                                              + [t].[to_SQLServer]
                                           
                                         ) AS t2
       WHERE a.Raised >= utils.DateTimeToTicks('2012/09/04')
       GROUP BY [t2].[clustername] ,
                     t2.[servername] ,
                     ISNULL(cad.[Name], at2.[Name])
       ORDER BY [t2].[clustername] ,
                     t2.[servername] ,
                     ISNULL(cad.[Name], at2.[Name])


Could you please let us know the result of each query?

Thanks,
Priya
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Thu Sep 06, 2012 2:48 pm    Post subject: Reply with quote

Here are the results (I hope I got the UTC stuff figured out right, but can't guarantee it. Either way the number of alerts is quite a bit lower.)
The number of unsuccessful job runs in Monitor database since 9/4
390

The number of unsuccessful runs in sysjobhistory since 9/4
447

The number of each type of alerts which has been raised since 9/4 grouped by server. SERVER3 is the one in question here:
SERVER1 Long-running query 23
SERVER1 Schema modified - ISDB01 34
SERVER1 Instance1 Long-running query 18
SERVER1 Instance1 Schema modified - ISDB01 3
SERVER1 Instance2 Long-running query 19
SERVER1 Instance2 Schema modified - ISDB01 13
SERVER2 Long-running query 3
SERVER3 Blocked process 27
SERVER3 Job failed 22
SERVER3 Long-running query 71
SERVER3 SQL Server error log entry 2
SERVER3 SQL Server instance unreachable 1

SERVER3 Instance1 Long-running query 17
SERVER4 Deadlock 2
SERVER4 Long-running query 40
SERVER4 Monitoring error (SQL Server data collection) 1
SERVER5 Deadlock 31
SERVER5 Instance1 Long-running query 3
SERVER6 Blocked process 1
SERVER6 Deadlock 1
SERVER6 Log backup overdue 1
SERVER6 Long-running query 1
SERVER6 Instance1 Deadlock 1
SERVER6 Instance1 Long-running query 15
SERVER7 Backup overdue 46
SERVER7 Deadlock 17
SERVER7 Job failed 8
SERVER7 Long-running query 3
Back to top
View user's profile Send private message
priyasinha



Joined: 03 Jan 2007
Posts: 530

PostPosted: Thu Sep 06, 2012 4:45 pm    Post subject: Reply with quote

Hello,

Have you just got one job on this server? Also, is any 'job failure' alert disabled in Alert Settings?

I notice that you have one 'SQL Server instance unreachable' alert on this server. How long this alert was live for?

Thanks,
Priya
Back to top
View user's profile Send private message
fifthrace



Joined: 08 Jun 2012
Posts: 12
Location: Indianapolis, IN

PostPosted: Thu Sep 06, 2012 6:39 pm    Post subject: Reply with quote

There are more than 20 jobs running on this particular server.

The "SQL Server instance unreachable" lasted for 14 seconds and has only occurred one other time for this server. That was on August 7th. I think this is probably a coincidence and not related to this issue at all. The number of job failures seems to be significantly higher than the number of alerts I'm seeing.

None of the alert settings are disabled from what I can tell. There are a couple jobs that I manually changed the recipient of the email that gets sent out.

Thanks for continuing to look into this.

Robbie
Back to top
View user's profile Send private message
priyasinha



Joined: 03 Jan 2007
Posts: 530

PostPosted: Fri Sep 07, 2012 3:32 pm    Post subject: Reply with quote

Hi Robbie,

I have emailed you directly.

Thanks,
Priya
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic All times are GMT + 1 Hour
Go to page 1, 2  Next
Page 1 of 2

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group