Detect long Running Jobs Store Proc

We already have a query to detect long running jobs:

http://www.cookingsql.com/2017/02/detect-long-running-jobs-store-proc/

So the next logical step is to encapsulate the code on your DBA database and run it all the time. If a job is considered "long running", we should alert the DBA team.

There are several things we need to take in consideration:

  • Define what a long running job is.
  • Define how much data we are going to use to determine what a long running job is.
  • Ensure that we don't send emails too often.

A long running job is defined as one that's been running longer than two standard deviation above the average.

You can modify this definition within the code, but it's not really recommended.

The question of how much data we are going to use has to do with ensuring the average is meaningful.  Within the code we are taking the average of all data points within the last 30 days, and only consirer jobs that have at least @MinHistExecutions executions (with 10 the default).  Additionally, we are only interested on jobs that their duration takes longer than @MinAvgSecsDuration (with 5 mins the default).  The idea of these parameters is to reduce excessive notifications.

Since we don't want to flood the DBA team with notfications, we have within the code some logic that prevents from sending emails more often than 30 mins.

In order to keep track, we are forced to use a table, which should be created within the DBA team's database.

Here is the script to create the table:


USE [DBA]
GO

/****** Object:  Table [dbo].[Jobs_Running_Long]    Script Date: 1/27/2017 3:16:15 PM ******/

IF OBJECT_id('dbo.Jobs_Running_Long') IS NOT NULL
DROP TABLE [dbo].[Jobs_Running_Long]
GO

/****** Object:  Table [dbo].[Jobs_Running_Long]    Script Date: 1/27/2017 3:16:15 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[Jobs_Running_Long](
	[job_id]	 uniqueidentifier	NOT NULL,
	[JobName]	 sysname			NOT NULL,
	[State]		 varchar(29)		NOT NULL,
	[ExecutionDate]	datetime		NOT NULL,
	[AverageSecs] numeric(38, 6)	NOT NULL,
	[DurationSecs] float			NOT NULL,
	[capture_time] datetime			NULL,
	[email_done]   BIT				NULL

) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

After we create the table, we can create the Store Procedure, which is the one that does all the work:


USE DBA
GO
IF OBJECT_ID('dbO.SP_DETECT_LONG_RUNNING_JOBS') IS NOT NULL
drop procedure SP_DETECT_LONG_RUNNING_JOBS

go


CREATE PROCEDURE SP_DETECT_LONG_RUNNING_JOBS 
   @MinHistExecutions int   = 10.0
  ,@MinAvgSecsDuration int  = 600.0
  ,@JOB_STATE INT = 1
  ,@skip_email INT = 0
AS
BEGIN
/*=============================================
  File: long_running_jobs.sql
 
  Author: Thomas LaRock, http://thomaslarock.com/contact-me/
 
 MODIFIED AND ADAPTED BY MIGUEL QUINTANA 01/27/2017

  Summary: This script will check to see if any currently
			running jobs are running long. 
Sample:
exec SP_DETECT_LONG_RUNNING_JOBS @MinHistExecutions=5, @MinAvgSecsDuration =600 , @JOB_STATE =4 , @skip_email = 1

 
  Variables:
	@MinHistExecutions - Minimum number of job runs we want to consider 
	@MinAvgSecsDuration - Threshold for minimum duration we care to monitor
        @HistoryStartDate - Start date for historical average
        @HistoryEndDate - End date for historical average
 
        These variables allow for us to control a couple of factors. First
        we can focus on jobs that are running long enough on average for
        us to be concerned with (say, 30 seconds or more). Second, we can
        avoid being alerted by jobs that have run so few times that the
        average and standard deviations are not quite stable yet. This script
        leaves these variables at 1.0, but I would advise you alter them
        upwards after testing.
 
  Returns: One result set containing a list of jobs that
	are currently running and are running longer than two standard deviations 
        away from their historical average. The "Min Threshold" column
        represents the average plus two standard deviations. 
 
  Date: October 3rd, 2012
 
  SQL Server Versions: SQL2005, SQL2008, SQL2008R2, SQL2012
 
  You may alter this code for your own purposes. You may republish
  altered code as long as you give due credit. 
 
  THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
  FITNESS FOR A PARTICULAR PURPOSE.
 
=============================================*/

SET NOCOUNT ON

DECLARE   
	@HistoryStartDate datetime 
   ,@HistoryEndDate datetime  
   
SET @HistoryStartDate = DateAdd(dd,-30,GETDATE()) --'19000101'
SET @HistoryEndDate = GETDATE()

Declare @DBATeam nvarchar(100) = 'DBASpt@cir2.com'



/***** DO NOT MODIFY BELOW *********/
DECLARE @RESULTS TABLE (
	[job_id] [uniqueidentifier] NOT NULL,
	[JobName] [sysname] NOT NULL,
	[State] [varchar](29) NOT NULL,
	[ExecutionDate] [datetime] NULL,
	[AverageSecs] [numeric](38, 6) NULL,
	[DurationSecs] [float] NULL,
	[capture_time] [datetime] NOT NULL
)

 
DECLARE @currently_running_jobs TABLE (
    job_id UNIQUEIDENTIFIER NOT NULL
    ,last_run_date INT NOT NULL
    ,last_run_time INT NOT NULL
    ,next_run_date INT NOT NULL
    ,next_run_time INT NOT NULL
    ,next_run_schedule_id INT NOT NULL
    ,requested_to_run INT NOT NULL
    ,request_source INT NOT NULL
    ,request_source_id SYSNAME NULL
    ,running INT NOT NULL
    ,current_step INT NOT NULL
    ,current_retry_attempt INT NOT NULL
    ,job_state INT NOT NULL
    ) 
 
--capture details on jobs
INSERT INTO @currently_running_jobs
EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,''
 
;WITH JobHistData AS
(
  SELECT job_id
	,date_executed=msdb.dbo.agent_datetime(run_date, run_time)
	,secs_duration=run_duration/10000*3600
                      +run_duration%10000/100*60
                      +run_duration%100
  FROM msdb.dbo.sysjobhistory
  WHERE step_id = 0   --Job Outcome
  AND run_status = 1  --Succeeded
)
,JobHistStats AS
(
  SELECT job_id
        ,AvgDuration = AVG(secs_duration*1.)
        ,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration)
  FROM JobHistData
  WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101')
  AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101')   
  GROUP BY job_id   HAVING COUNT(*) >= @MinHistExecutions
  AND AVG(secs_duration*1.) >= @MinAvgSecsDuration
)

INSERT INTO  @RESULTS 
SELECT JHS.job_id	
		,j.name AS [JobName]
      ,case crj.job_state
		when 0 then 'not idle or suspended'
		when 1 then	'Executing'
		when 2 then	'Waiting for thread'
		when 3 then	'Between retries'
		when 4 then	'Idle'
		when 5 then	'Suspended'
		when 7 then	'Performing completion actions'
		else ''
 	   end as [State]
	  ,MAX(act.start_execution_date) AS [ExecutionDate]
      ,AvgDuration AS [AverageSecs]
      ,AvgPlus2StDev AS [DurationSecs]
	  ,GETDATE() AS [email_Done]
FROM JobHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id
JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id
AND act.stop_execution_date IS NULL
AND act.start_execution_date IS NOT NULL

WHERE 1=1
AND secs_duration > AvgPlus2StDev
AND DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev

-- THIS SECTION EXCLUDE JOBS ALREADY IN THE SYSTEM FOR LESS THAN 30 MINS
AND JHS.job_id NOT IN (SELECT A.JOB_ID FROM DBA.dbo.JOBS_Running_Long A WHERE DATEDIFF(mi,A.CAPTURE_TIME,GETDATE()) < 30)


AND crj.job_state = @JOB_STATE


GROUP BY  JHS.job_id, j.name, AvgDuration, AvgPlus2StDev, crj.job_state


IF @SKIP_EMAIL = 1

BEGIN

SELECT JobName, State, ExecutionDate,
convert(decimal(8,1),AverageSecs/60) 'Average in Mins' ,
convert(decimal(8,1),DurationSecs/60) 'Duration in Mins'
FROM @RESULTS

END


IF @skip_email = 0 AND @JOB_STATE = 1
BEGIN


INSERT INTO DBA.dbo.Jobs_Running_Long
SELECT [job_id]
      ,[JobName]
      ,[State]
      ,[ExecutionDate]
      ,[AverageSecs]
      ,[DurationSecs]
	  ,[capture_time]
	  ,NULL 
FROM @RESULTS


-- /*  Builds an Email for the DBAs

DECLARE @SUBJECT NVARCHAR(100)
DECLARE @BODY NVARCHAR(MAX)
DECLARE @xml NVARCHAR(MAX)

SET @SUBJECT = 'LONG RUNNING QUERY DETECTED ON '+CAST(SERVERPROPERTY('SERVERNAME') AS VARCHAR(15))
SET @xml = CAST((
SELECT
td =					  [JobName],'',
[td/@align]='right', td = convert(decimal(10,2),[AverageSecs]/60),'',
[td/@align]='right', td = convert(decimal(10,2),[DurationSecs]/60),''
FROM  DBA.DBO.JOBS_RUNNING_LONG
WHERE email_done is null 

FOR XML PATH('tr'), ELEMENTS ) AS NVARCHAR(MAX))

SET @body =
'

Long running Job Detected

One or more jobs were found to be running longer than average, please investigate. Please note, the job might have stopped by the time you receive this message. Check the table dba.dbo.jobs_running_long for details

' SET @body = @body + @xml +'
Job Name AverageDuration (Mins) Duration_Mins
' IF (SELECT COUNT(*) FROM DBA.DBO.JOBS_RUNNING_LONG WHERE email_done is null)>0 begin EXEC MSDB.DBO.SP_SEND_DBMAIL @RECIPIENTS = @DBATeam, @SUBJECT = @SUBJECT, @BODY = @BODY, @body_format ='HTML' end UPDATE DBA.DBO.JOBS_RUNNING_LONG SET email_done = 1 WHERE email_done IS NULL END -- SKIP EMAIL ELSE PRINT '@SKIP EMAIL = 0 MUST BE USED WITH @JOB_STATE = 1. Aborting Email' END

It is important to note that the store proc uses the default mail profile.

After all of this is created, the DBA team can run the following command to check the status of jobs:

-- For running jobs
exec SP_DETECT_LONG RUNNING_JOBS @MinHistExecutions=5, @MinAvgSecsDuration =600 , @JOB_STATE =1 , @skip_email = 1
-- For idle jobs
exec SP_DETECT_LONG_RUNNING_JOBS @MinHistExecutions=5, @MinAvgSecsDuration =600 , @JOB_STATE =4 , @skip_email = 1

After the store proc is ready, We can create a job that runs every 15 mins with one of the statements above.

 

.

Leave a Reply

Your email address will not be published. Required fields are marked *