Using SSIS to monitor SQL Server Databases

Rodney Landrum is a DBA who manages over eighty SQL Servers. It is a job that requires a great deal of ingenuity. He needed a way of knowing if there were signs of trouble anywhere in his servers. Here he describes how the use of SSIS saved him a great deal of time in getting a monitoring system in place and working

Analyzing SQL Server Connections Over Time Using SSIS

I wanted to know if there were any long-running jobs or processes in any of the eighty-odd SQL Servers that I manage. These are a sign of a problem that has to be investigated and put right.  I’d developed a simple query to get the process Information from each database, but I wasn’t sure if this would interfere with  its performance if it ran once every hour in a SQL Agent job: if so, was it worth it?

I soon realized that this  could do much more than to isolate  long running processes: I found that I could now deep dive into the data for trend analysis, capacity planning, concurrency issues, and even potential security breaches.  In this article I will demonstrate the simple SSIS package that I use  to collect and gather all the  process information for these servers. I’ll also show you the queries that I use to analyze the hundreds of thousands of records I collected in just the first week . 

Connection Collection

I’ve just said that the query to return the results  that I needed was simple.  Listing 1 should back me up on that.  Notice, though, that even though it is simple, almost everything that you could want is returned. Sure, there are other fields that I could have used, like a “blocked_by” column, the same as the sp_who2 “blkby” column would provide.  Also, it would have been useful to know  the actual query that was being executed at the time, but this is not a server trace after all.  To use a drummer’s cliché, “Less is more.”  What I needed was a way to identify potential problems and then use other tools to get the details.

Listing 1: Simple Connections Query

Notice the one join from the sysprocesses table in the Master database to the sysdatabases table. This join provides us with the database name ,not just the DBID field. Other than that, and the Run_Date and @@Servername, the query is a straightforward result set from sysprocesses. This query will run against SQL Server 2000 and 2005.

The real trick is to execute the query against all the SQL servers and store this information centrally. SSIS will accomplish this nicely. If the query to retrieve the process information is so simple, the SSIS package to run the query should be just as simple, right? Absolutely. Figure 1 shows a snapshot of the SSIS package that executes this query for each server, one at a time, and combines the results in a single table, called “sql_conn” in a repository database.

Figure 1.

512-image001.jpg

The SSIS package consists of only four objects. We will go through each object individually, pointing out any interesting bits along the way.  This  solution relies on a repository database called DBA_Rep,  consisting, in this case, of two tables. I’ll explain why in a moment.  The SQL connection task is a subset of a much larger DBA repository that I use to gather daily information about my SQL Server infrastructure.

Let’s take a closer look at each object to see how it fits into the overall SSIS package execution timeline.

When the package executes, the “Populate ADO Variable” is instantiated first. It is only run once and its sole purpose in life is to fill a variable with SQL server names derived from an underlying table from the DBA_Rep database. The query in Listing 2 is not much to look at, but is all important. 

Listing 2: Query to Return Server List.

The use of the  LTRIM and RTRIM function in the query stems from my laziness in wanting to deal with this initially when loading the server into the driving table, ServerList._SSIS, I hate to say. I used “sqlcmd /Lc” to capture servers on my network to initially populate the ServerList_SSIS package.   I use the Server field extensively in joining to other tables.  The “Connect” and “DMZ” fields are Boolean flags that I can set to control authentication and connection mechanisms. For example, if I know that the Server called_AAA has been retired, I will set its Connect value to 0 so that it is ignored when the package is executed, thereby avoiding a connection error. Likewise, the DMZ flag controls “how” I will connect to the servers, either via Windows authentication or SQL authentication.  In DMZs, as you may know, sometimes SQL Authentication is all that is available. 

The variable that is populated with the execution of the query is of type “System.Object“. This variable type is required to store a result set of more than one server in an array.  This variable, SQL_RS, is assigned to be populated by the Execute SQL Task on the Result Set tab that you can see in Figure 2.

512-image002.jpg

 It is necessary to convert the SQL-RS variable to a string, which the next object, the Foreach Loop Container, can use to “roll” through the resultant servers. To dynamically use the current value of the variable , the Servername property must be set via an expression in the Connection Manager object, MultiServer.    In the Foreach Loop Container properties, Figure 3, set the Enumerator to be an ADO object source variable, the SQL_RS (which syntactically will be “User::SQL_RS”) that we specified earlier.

512-image004.jpg

On the variable mappings tab, set the object variable to be a string variable named “SRV_Conn“. Figure 4 shows the Variable Mappings tab of the Foreach Loop Container object. Notice that the Index is set to 0.

512-image006.jpg

The MultiServer connection object in the SSIS package , whose properties can be viewed in Figure 5, will use the SRV_Conn string variable as an expression.  By setting the ServerName property of the MultiServer connection to @[User::SRV_Conn] each iteration of the Foreach Loop Container will use the next server name value of the variable and alter the connection string appropriately.

512-image008.jpg

 

The ‘Get Server SPIDs’ data flow task inside of the Foreach Loop Container will connect to each server in turn, execute the original connection query and then insert the results to a single table, “SQL_Conn”, in the repository destination database.  Figure 5 shows the Data Flow task of the source and destination objects as well as the destination properties.

512-image010.jpg

The final object in the package is the “Write Error Log” script task. It, too, is rather simple. If there is an error in the data flow, then the server name, along with a very generic error message, is written to a log file. I use this method instead of an SMTP event handler that would send me mail upon error. I do this mainly because I figure I get enough mail in a day, and since I have set the package to be resilient to connection errors of up to a 100, I do not care if I miss one or two servers occasionally.  If there is a connection error, for example, the step will error and continue on with the next server, not failing the entire package. The code to write the variable out to the log file that I can check at my leisure is in Listing 3.

Listing 3 – Script to Write to Error Log.

So, now that we know the package and what it is going to do, the job of putting it into action is as simple as loading it into SSIS and setting up a SQL Agent job to run it on a set schedule.  Since there are no stored credentials in this package, this process should be straightforward if you already know how to schedule an SSIS package to run via SQL Agent. If you don’t, then I’ve listed several places where you can find out at the end of this article. Figure 6 shows the job history of the Long_Running_Jobs package in my infrastructure. You’ll see that the average time for over 80 servers is less than 1 minute.

512-image012.jpg

 I have chosen to run it on a schedule of once every hour beginning at 4:00 AM and ending at 7:00PM.  For my environment, this pulls many thousands of records per day. Querying this much data, obviously necessitates having the appropriate indexes defined on the SQL_Conn table. The code download includes all the scripts that will build the DBA_Rep database and all tables and indexes required for this solution to work for you.   Once you have the solution deployed and running automatically, it is time to start sifting through this mountain of data. As I said at the outset, I had intended this solution for a single use, to show long running jobs.  The queries presented in the next section, however, show that much more information can be gleaned. 

Data Analyst Hat

The first query that I use to interrogate long running jobs uses an surprisingly large threshold value of 5 hours to demonstrate that in the real world, indeed, some SQL Agent jobs take quite a while to complete.  This is most obvious for re-indexing jobs on slower servers: Hhowever, there are some cases where jobs get hung for other reasons and must be manually released. On occasion, we have seen backup jobs get hung up or other processes get log jammed behind other jobs waiting for locks to release. In these circumstances, it is important to know about these issues as soon as possible. In lieu of setting up SQL alerts on each server to trigger off of these events, we decided to let SSIS do the work.  Granted, there is some manual investigation required for now, but setting up reports to be sent when threshold values are exceeded is not a difficult proposal and is more easily administered when all of the data is centrally housed. Listing 5 shows a query that displays long running jobs from the “SQL_Conn” table. You’ll see from the WHERE clause that, in this case, we only want processes that contain “job” or “agent”. Also, the query only returns rows that were gathered in the past 15 minutes, GETDATE() – .01048

Listing 5. Long Running Jobs Query

In the results of the query, you can see that we have uncovered a long running job.  In this case, the query has revealed the job name itself, “Backup – Log [Full]” derived by parsing the Program_Name to get at the job_id. The job_id is returned only in SQL Server 2000 when a SQL Agent job is running. Notice that we joined the “Jobs” table from the job_id we got using the SUBSTRING function.   The Jobs table is also included in the full repository solution available at: http://www.sqlmag.com/Articles/ArticleID/97840/97840.html

512-image014.jpg

 If you do not have the jobs table, the following query will deliver the same results without the job name.

Listing 6.  Long Running Jobs Query Minus Job Name

The next query is one that has proved most useful when analyzing a SQL Server that has recently been suffering from poor performance.  A simple count of connections at various points of the day might indicate if there is an increase in resource contention due to excessive workload.  The following query shows a daily connection count for the past 10 days.

Listing 6: Connection Counts by Day

The results are from an actual server that has been experiencing slowdowns at several points of the day. The month is May, and you can see where the connections fall in line with expectations, say on the weekend or on a holiday (May 26th, Memorial Day). There is a sharp rise in connections Monday – Thursday with a slight tapering off on Friday.  Most of the problems have been reported around noon each weekday.  Knowing this, we can further break down the results to several points during the day.

Day_Name

Month_Day

Connection_Count

Saturday

31

1716

Friday

30

3697

Thursday

29

4719

Wednesday

28

4494

Tuesday

27

4415

Monday

26

1546

Sunday

25

1581

Saturday

24

1635

Friday

23

3387

Thursday

22

4617

Wednesday

21

4897

Tuesday

20

4709

Monday

19

4750

Sunday

18

1259

Saturday

17

457

So now let’s look at a single day and connections each hour. Only a slight modification to the query is required.  We use the Run_Date field value but not the  DatePart()  or DateName() functions.  Also, the criteria to select a literal date, “05/30/2008” are added.

The results are certainly revealing. Of the 3697 connections that were counted between 6:00AM and 11:00PM, we can see that the peak for that day was between11:00AM with 403 connections and 3:00PM with 209 connections.  So what else might have been happening around that time of day?  

Run_Date

Connection_Count

11:00 PM

107

10:00 PM

105

9:00 PM

109

8:00 PM

114

7:00 PM

112

6:00 PM

117

5:00 PM

115

4:00 PM

135

3:00 PM

209

2:00 PM

362

1:00 PM

409

12:00 PM

395

11:00 AM

403

10:00 AM

392

9:00 AM

366

7:00 AM

139

6:00 AM

108

 The final query enlightens us to what was going on by including the “blocked” field from the “SQL_Conn” table.

We see that at 3:00PM there was blocking occurring.

Run_Date

blocked

Connection_Count

11:00 PM

0

107

10:00 PM

0

105

9:00 PM

0

109

8:00 PM

0

114

7:00 PM

0

112

6:00 PM

0

117

5:00 PM

0

115

4:00 PM

0

135

3:00 PM

0

208

3:00 PM

1

1

2:00 PM

0

362

1:00 PM

0

409

12:00 PM

0

395

11:00 AM

0

403

10:00 AM

0

392

9:00 AM

0

366

7:00 AM

0

139

6:00 AM

0

108

At this point we do not know what was blocking or what was being blocked. We then discovered from further investigation of the “SQL_Conn” table, that blocking was typically happening at 3:00PM. This led us to look at the SQL Agent job schedules. This forensic work uncovered a job that was scheduled to defrag all fragmented indexes at 3:00Pm every day. Normally, this ran in under 3 minutes and was not previously causing a problem. It was not until more and more users began using the application that this created a contention for resources and caused subsequent blocking.

Conclusion

There a many ways to isolate performance issues with SQL Server. Most good DBAs have a grab-bag of tools to isolate and remedy those problem servers.   However,  this discovery process is often manual and worse, done at the request of irate users who have phoned up the help desk to say they have been kicked out or locked up in their applications;  all they recall are the three letters that they alone pronounce correctly, “ES – Queue -E L”. It helps the DBA to be  able to interrogate a fully populated table of connections gathered every hour of the day for a previous month.  The DBA can then  make important and costly decisions knowing the  history.  However, it is even more important to hone in on a problem before the users even pick up the phone.  I hope this solution can provide readers a way to begin to do both.

 

More Information

http://msdn.microsoft.com/en-us/library/ms139805.aspx