We already have a query to detect long running jobs:
http://www.cookingsql.com/2017/02/detect-long-running-jobs-store-proc/
So the next logical step is to encapsulate the code on your DBA database and run it all the time. If a job is considered "long running", we should alert the DBA team.
There are several things we need to take in consideration:
-
Define what a long running job is.
-
Define how much data we are going to use to determine what a long running job is.
-
Ensure that we don't send emails too often.
A long running job is defined as one that's been running longer than two standard deviation above the average.
You can modify this definition within the code, but it's not really recommended.
The question of how much data we are going to use has to do with ensuring the average is meaningful. Within the code we are taking the average of all data points within the last 30 days, and only consirer jobs that have at least @MinHistExecutions executions (with 10 the default). Additionally, we are only interested on jobs that their duration takes longer than @MinAvgSecsDuration (with 5 mins the default). The idea of these parameters is to reduce excessive notifications.
Since we don't want to flood the DBA team with notfications, we have within the code some logic that prevents from sending emails more often than 30 mins.
In order to keep track, we are forced to use a table, which should be created within the DBA team's database.
Here is the script to create the table:
USE [DBA]
GO
/****** Object: Table [dbo].[Jobs_Running_Long] Script Date: 1/27/2017 3:16:15 PM ******/
IF OBJECT_id('dbo.Jobs_Running_Long') IS NOT NULL
DROP TABLE [dbo].[Jobs_Running_Long]
GO
/****** Object: Table [dbo].[Jobs_Running_Long] Script Date: 1/27/2017 3:16:15 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_PADDING ON
GO
CREATE TABLE [dbo].[Jobs_Running_Long](
[job_id] uniqueidentifier NOT NULL,
[JobName] sysname NOT NULL,
[State] varchar(29) NOT NULL,
[ExecutionDate] datetime NOT NULL,
[AverageSecs] numeric(38, 6) NOT NULL,
[DurationSecs] float NOT NULL,
[capture_time] datetime NULL,
[email_done] BIT NULL
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
After we create the table, we can create the Store Procedure, which is the one that does all the work:
USE DBA
GO
IF OBJECT_ID('dbO.SP_DETECT_LONG_RUNNING_JOBS') IS NOT NULL
drop procedure SP_DETECT_LONG_RUNNING_JOBS
go
CREATE PROCEDURE SP_DETECT_LONG_RUNNING_JOBS
@MinHistExecutions int = 10.0
,@MinAvgSecsDuration int = 600.0
,@JOB_STATE INT = 1
,@skip_email INT = 0
AS
BEGIN
/*=============================================
File: long_running_jobs.sql
Author: Thomas LaRock, http://thomaslarock.com/contact-me/
MODIFIED AND ADAPTED BY MIGUEL QUINTANA 01/27/2017
Summary: This script will check to see if any currently
running jobs are running long.
Sample:
exec SP_DETECT_LONG_RUNNING_JOBS @MinHistExecutions=5, @MinAvgSecsDuration =600 , @JOB_STATE =4 , @skip_email = 1
Variables:
@MinHistExecutions - Minimum number of job runs we want to consider
@MinAvgSecsDuration - Threshold for minimum duration we care to monitor
@HistoryStartDate - Start date for historical average
@HistoryEndDate - End date for historical average
These variables allow for us to control a couple of factors. First
we can focus on jobs that are running long enough on average for
us to be concerned with (say, 30 seconds or more). Second, we can
avoid being alerted by jobs that have run so few times that the
average and standard deviations are not quite stable yet. This script
leaves these variables at 1.0, but I would advise you alter them
upwards after testing.
Returns: One result set containing a list of jobs that
are currently running and are running longer than two standard deviations
away from their historical average. The "Min Threshold" column
represents the average plus two standard deviations.
Date: October 3rd, 2012
SQL Server Versions: SQL2005, SQL2008, SQL2008R2, SQL2012
You may alter this code for your own purposes. You may republish
altered code as long as you give due credit.
THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
FITNESS FOR A PARTICULAR PURPOSE.
=============================================*/
SET NOCOUNT ON
DECLARE
@HistoryStartDate datetime
,@HistoryEndDate datetime
SET @HistoryStartDate = DateAdd(dd,-30,GETDATE()) --'19000101'
SET @HistoryEndDate = GETDATE()
Declare @DBATeam nvarchar(100) = 'DBASpt@cir2.com'
/***** DO NOT MODIFY BELOW *********/
DECLARE @RESULTS TABLE (
[job_id] [uniqueidentifier] NOT NULL,
[JobName] [sysname] NOT NULL,
[State] [varchar](29) NOT NULL,
[ExecutionDate] [datetime] NULL,
[AverageSecs] [numeric](38, 6) NULL,
[DurationSecs] [float] NULL,
[capture_time] [datetime] NOT NULL
)
DECLARE @currently_running_jobs TABLE (
job_id UNIQUEIDENTIFIER NOT NULL
,last_run_date INT NOT NULL
,last_run_time INT NOT NULL
,next_run_date INT NOT NULL
,next_run_time INT NOT NULL
,next_run_schedule_id INT NOT NULL
,requested_to_run INT NOT NULL
,request_source INT NOT NULL
,request_source_id SYSNAME NULL
,running INT NOT NULL
,current_step INT NOT NULL
,current_retry_attempt INT NOT NULL
,job_state INT NOT NULL
)
--capture details on jobs
INSERT INTO @currently_running_jobs
EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,''
;WITH JobHistData AS
(
SELECT job_id
,date_executed=msdb.dbo.agent_datetime(run_date, run_time)
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
FROM msdb.dbo.sysjobhistory
WHERE step_id = 0 --Job Outcome
AND run_status = 1 --Succeeded
)
,JobHistStats AS
(
SELECT job_id
,AvgDuration = AVG(secs_duration*1.)
,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration)
FROM JobHistData
WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101')
AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101')
GROUP BY job_id HAVING COUNT(*) >= @MinHistExecutions
AND AVG(secs_duration*1.) >= @MinAvgSecsDuration
)
INSERT INTO @RESULTS
SELECT JHS.job_id
,j.name AS [JobName]
,case crj.job_state
when 0 then 'not idle or suspended'
when 1 then 'Executing'
when 2 then 'Waiting for thread'
when 3 then 'Between retries'
when 4 then 'Idle'
when 5 then 'Suspended'
when 7 then 'Performing completion actions'
else ''
end as [State]
,MAX(act.start_execution_date) AS [ExecutionDate]
,AvgDuration AS [AverageSecs]
,AvgPlus2StDev AS [DurationSecs]
,GETDATE() AS [email_Done]
FROM JobHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id
JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id
AND act.stop_execution_date IS NULL
AND act.start_execution_date IS NOT NULL
WHERE 1=1
AND secs_duration > AvgPlus2StDev
AND DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev
-- THIS SECTION EXCLUDE JOBS ALREADY IN THE SYSTEM FOR LESS THAN 30 MINS
AND JHS.job_id NOT IN (SELECT A.JOB_ID FROM DBA.dbo.JOBS_Running_Long A WHERE DATEDIFF(mi,A.CAPTURE_TIME,GETDATE()) < 30)
AND crj.job_state = @JOB_STATE
GROUP BY JHS.job_id, j.name, AvgDuration, AvgPlus2StDev, crj.job_state
IF @SKIP_EMAIL = 1
BEGIN
SELECT JobName, State, ExecutionDate,
convert(decimal(8,1),AverageSecs/60) 'Average in Mins' ,
convert(decimal(8,1),DurationSecs/60) 'Duration in Mins'
FROM @RESULTS
END
IF @skip_email = 0 AND @JOB_STATE = 1
BEGIN
INSERT INTO DBA.dbo.Jobs_Running_Long
SELECT [job_id]
,[JobName]
,[State]
,[ExecutionDate]
,[AverageSecs]
,[DurationSecs]
,[capture_time]
,NULL
FROM @RESULTS
-- /* Builds an Email for the DBAs
DECLARE @SUBJECT NVARCHAR(100)
DECLARE @BODY NVARCHAR(MAX)
DECLARE @xml NVARCHAR(MAX)
SET @SUBJECT = 'LONG RUNNING QUERY DETECTED ON '+CAST(SERVERPROPERTY('SERVERNAME') AS VARCHAR(15))
SET @xml = CAST((
SELECT
td = [JobName],'',
[td/@align]='right', td = convert(decimal(10,2),[AverageSecs]/60),'',
[td/@align]='right', td = convert(decimal(10,2),[DurationSecs]/60),''
FROM DBA.DBO.JOBS_RUNNING_LONG
WHERE email_done is null
FOR XML PATH('tr'), ELEMENTS ) AS NVARCHAR(MAX))
SET @body =
'
Long running Job Detected
One or more jobs were found to be running longer than average, please investigate. Please note, the job might have stopped by the time you receive this message. Check the table dba.dbo.jobs_running_long for details
Job Name | AverageDuration (Mins) | Duration_Mins |
'
SET @body = @body + @xml +'
'
IF (SELECT COUNT(*) FROM DBA.DBO.JOBS_RUNNING_LONG WHERE email_done is null)>0
begin
EXEC MSDB.DBO.SP_SEND_DBMAIL
@RECIPIENTS = @DBATeam,
@SUBJECT = @SUBJECT,
@BODY = @BODY,
@body_format ='HTML'
end
UPDATE DBA.DBO.JOBS_RUNNING_LONG SET email_done = 1 WHERE email_done IS NULL
END -- SKIP EMAIL
ELSE
PRINT '@SKIP EMAIL = 0 MUST BE USED WITH @JOB_STATE = 1. Aborting Email'
END
It is important to note that the store proc uses the default mail profile.
After all of this is created, the DBA team can run the following command to check the status of jobs:
-- For running jobs
exec SP_DETECT_LONG RUNNING_JOBS @MinHistExecutions=5, @MinAvgSecsDuration =600 , @JOB_STATE =1 , @skip_email = 1
-- For idle jobs
exec SP_DETECT_LONG_RUNNING_JOBS @MinHistExecutions=5, @MinAvgSecsDuration =600 , @JOB_STATE =4 , @skip_email = 1
After the store proc is ready, We can create a job that runs every 15 mins with one of the statements above.
.