The Usual Tech Ramblings

SQL Job Monitoring

We have a weird issue at work where an import task will occasionally hang waiting for nothing. This can sometimes be harmless, and othertimes disastrous. When it’s disastrous, it causes the main table in our database to be locked, denying other services/users from performing any inserts, or updates, or even selects. This is a big problem… obviously.

When we were hosted at an AT&T data center, this wasn’t too much of an issue as they had the facilities to monitor, and alert on a whole range of SQL monitors. We’d get alerted as soon as a job would be running too long, or if there were too many blocked processes.

Now we’ve moved to the corporate data center, those tools aren’t available any more, and they don’t have the facilities to implement them. The predicted time frame is end of Q2, to the middle of Q3, before being able to expand any of these features. So that leaves me. I had to some how figure out a way to monitor a long running SQL job, and generate alerts based on that information. This lead me on a path into the depths of the msdb database that I’d rather not have to follow… ever.

There are a couple of stored procedures, that Microsoft provide, which give some insight into the status, and information, about a job. The first was sp_help_job. This handy little procedure provides information such as last start time, job status, and such. This is, what I thought, would be my savior in the search for the long running jobs. According to the SQL documentation, the last start time was set as soon as a job started, and as I had the running status, I could easily do some calculations based on these. So the first step was to put it somewhere I could handle it. I was thinking of a temporary table, so I attempted the standard code:

insert into ##tmpJobs
  exec msdb.dbo.sp_help_job

This failed, and generated the following error message:

Server: Msg 8164, Level 16, State 1, 
  Procedure sp_get_composite_job_info, Line 67
An INSERT EXEC statement cannot be nested.

This was entirely baffling, I’ve never seen this before. So I started Googling around, and stumbled across this little item. Apparently the sp_help_job code executes an INSERT EXEC statement. This means you cannot do a secondary insert using the output data. However, the page goes on to explain a better resolution.

That better resolution comes in the form of xp_sqlagent_enum_jobs. This is an extended stored procedure found in the msdb database. The site then goes on to show a nice example on usage, which goes something like this:

create table #enum_job ( 
    Job_ID uniqueidentifier, 
    Last_Run_Date int, 
    Last_Run_Time int,
    Next_Run_Date int, 
    Next_Run_Time int, 
    Next_Run_Schedule_ID int, 
    Requested_To_Run int, 
    Request_Source int, 
    Request_Source_ID varchar(100), 
    Running int, 
    Current_Step int, 
    Current_Retry_Attempt int, 
    State int 

insert into #enum_job 
     exec master.dbo.xp_sqlagent_enum_jobs 1,garbage  
select * from #enum_job

So now I have a table called #enum_job that contains a list of all the jobs. Using the State column, we know if its running or not, and using the Last_Run_Date, and Last_Run_Time we can find the last start time. This is where a row of issues flooded out.

The first issue was relating to Microsoft not using a simple datetime field for their last run date, and last run time fields. In fact, they used a fairly generic format. For the date, they used the format yyyymmdd, and for the time, they used hhmmss. Oddly, the convert function could handle the date in the format that was stored, but didn’t know what to do with the time format. As they were both stored as integers, if your job executed any time before 12pm, the length of the value would be truncated. For example, and job that executed at 04:15:00, would look like 41500. A job executed at 12:15:01am would look like 1501. So some math was required. This was beginning make my head hurt. So I had to think about the conversions. Did I mention yet that I was given 4 hours to complete this in? No? Well I was… So with more playing with math, I ended up building the following custom function:

create function [dbo].[fn_ConvJobDateTime2DateTime] 
  (@jobdate int, @jobtime int)
returns datetime
  declare @date datetime,
    @year int,
    @month int,
    @day int,
    @hour int,
    @min int,
    @sec int,
    @datestr nvarchar(40)

    select @year = (@jobdate / 10000)
    select @month = (@jobdate - (@year * 10000)) / 100
    select @day = (@jobdate - (@year * 10000) - (@month * 100))

    select @hour = (@jobtime / 10000)
    select @min = (@jobtime - (@hour * 10000)) / 100
    select @sec = (@jobtime - (@hour * 10000) - (@min * 100))

    select @datestr = convert(nvarchar(4), @year) + N'-' +
                  convert(nvarchar(2), @month) + N'-' +
                  convert(nvarchar(4), @day) + N' ' +
                  replace(convert(nvarchar(2), @hour) + N':' +
                          convert(nvarchar(2), @min) + N':' +
                          convert(nvarchar(2), @sec), ' ', '0')
  select @date = convert(datetime, @datestr)

  return @date

This custom function will take the crappy formatted date and time strings from the job format, and output a true datetime format. This means that I can now take the start time/date, and do some real date math on it, such as datediff This is where issue number 2 stumbled in…

Apparently, Microsoft’s documentation is a little off. While the documentation says the last run date/time is updated on the start of a job, it is in reality, not updated until the job ends. Saying that however, the state is updated. So I was one step closer. After a little thinking, I ended up with 2 permanent tables, and a handful of temporary tables. The 2 permanent tables consisted of a job state, and a job threshold. The job threshold is fairly self-explanatory. We set a threshold for the jobs before an alert is triggered. The table consists of two fields, the job_id, and the threshold. We used the history of the jobs to gauge an average execution time, and bumped the threshold a few minutes higher. The second table is a permanent temporary table. It basically contains the job_id, the time the job was last run, and the datetime the record was created. If the job record exists already, a comparison is first done against the last run date/time. If they are not the same, then the record is updated with the new data, and a new “creation” time. If the last run times are the same, a date comparison is done between the record time, and getdate(). This is then compared against the threshold time for that job. Once a job exceeds the threshold, it is dropped into a temporary table which is used to generate an email. This email then contains a list of all the running jobs.

This took me 2 hours to write, and has been running since Friday of last week. So far, we’ve not gone past the threshold (I’ve verified a few of the jobs), so it seems all is well.

While I was writing this up, I tried searching for the original item I had used to figure out the solution to the INSERT EXEC error, I stumbled across this. This article demonstrates a different method based on the login time of the process that is executing the job. I will have to explore that as another option, but I think in our environment, it might cause false positives as some jobs call other jobs to run, and I’m not sure how that’d tie in with connections.

All in all, this has been an interesting exploration of the SQL Jobs information. One I really don’t wish to do on a frequent basis. Now it’s beer time…