Monitoring Transactional Replication in SQL Server

If you are using replication in SQL Server, you can monitor it in SSMS, but it makes sense to monitor distribution jobs automatically, especially if you can set up alerts or even set up first-line remedial action when a problem is detected. Francis shows how to do it in TSQL as an agent job.

There are several variables that we can examine in order to determine the health of our transactional replication system. Whilst the values  of these various counters may not, by themselves, indicate a problem, they can serve to give us clues as to what we can do to get replication back into a healthy status.

Normally we can monitor replication with the Replication Monitor Tool that is part of SQL Server Management Studio.  Since we cannot watch this 24 hours a day, it is useful to supplement this with a programmatic way to monitor replication.  Even if you opt to use a 3rd party tool like Red Gate’s SQL Monitor you may need to create a custom alert. Even if your tool of choice allows ‘out-of-the-box’ monitoring of replication, you might still need to set up a specific response to the counters if they exceed a threshold value.  More on that later.

One of the more useful tools supplied by Microsoft is the system stored procedure called dbo.sp_replmonitorsubscriptionpendingcmds which is stored in the distribution database.

This procedure returns both the number of commands that are pending for the subscription and an estimate of the number of seconds required to deliver all of the pending commands to the Subscriber.  I have a job that runs every 10 minutes and saves the data in a table as an historical record of the status.  I keep this data for 14 days.

This table is created in a DBA database installed on the subscriber server.  The code to create the table is:

Script 1

The data in this table is populated by the monitoring procedures and provides an historical context for examining issues.  But to monitor what is happening right now more is required.

There are three things that help to determine the health of replication.

  1. The status of the replication related jobs
  2. The latency (especially the distribution latency) as measured by the counter Dist:Delivery Latency  
  3. The number of outstanding commands that are pending for the subscription

I have focused on the distribution latency because past experience has shown this to be more of an issue than log reader latency. Much of the time, the distribution latency is due to an increase in  transaction volume. For example doing an index rebuild on a large table in the publisher database can cause a large increase in the volume of the transaction log which results in a higher than normal volume of data to be replicated. For further discussion of this idea see Linchi Shea’s blog from 2009 Distribution latency in transactional replication: Is a volume surge the culprit?

If there are a large  number of outstanding commands waiting to be distributed, then sometimes this is due to a distribution agent job that is not running.  On the other hand, sometimes this job is running but not keeping up. By stopping and starting the agent the job starts to process the outstanding commands.

   To start off, we need to get information about replication, such as the name of the publisher and the subscriber, the names of the distribution agent jobs etc.  Microsoft has supplied several procedures in the Distributor database to help to gather information.  My Distributor database is on the same server as my Subscriber database so my script is simpler than if they were on separate servers. Firstly executing sp_replmonitorhelppublisher returns monitoring information for all Publishers using this Distributor. Executing sp_replmonitorhelppublication returns monitoring information for all publications using this Distributor. Lastly, executing sp_replmonitorhelpsubscription returns monitoring information for all subscriptions.  This information includes some latency numbers so I after executing this procedure I already have some key information.

This is the code I use to gather this information:

Script 2

Once I know some of this basic information about my publishers and my subscribers I can then check the status of the distribution jobs.  They should all be running.  If any are not running then I need to start them.  If I need to restart a job I set a flag to force an email alert to be sent.  I don’t want to actually send the email yet since I need to check the status of all my subscriptions. Again, if this number exceeds my threshold then I also set the flag to trigger my email alert. I used a cursor to go through my subscriptions because this was the easiest way I could gather the information and then use this information as parameters to other stored procedures to determine if the distribution agent was running and potentially stop and restart the agent:

Script 3

I run the Microsoft-supplied procedure sp_replmonitorsubscriptionpendingcmds to gather information on outstanding commands and the expected number of seconds to catch up.

This is the information I want to store in my history table so I can get an idea of how replication is performing.

We need to decide on an acceptable threshold for latency.  I use sixty minutes, meaning that I want to be notified if the replicated database is more than sixty minutes behind the publisher database.  The other decision is the maximum number of undistributed commands.  If this number trends upwards as opposed to fluctuating, then there may be a problem. It is your choice how high you want to let this number get before taking action.  I choose to let the system get to 80000 undistributed commands. 

I picked these numbers by letting my Replication Queue checker job run for a couple of weeks, ensuring all maintenance jobs like index rebuild jobs had run.  Then I looked at the periods with the greatest number of undistributed commands and the greatest latency and made sure my settings were higher.  I don’t want to get called at night if a Rebuild Index job causes the system to backup temporarily but is capable of easily recovering by itself.

The code below requires the Ad Hoc Distributed Queries server configuration option be enabled. Here I create the email to be sent assuming the previous Script 3 found an issue.

Script 4

 The last process is to periodically delete rows from the replication status table so the data does not get stale.

Script 5

If any of the thresholds you set up in this script are met, then the distribution agent corresponding to the subscription with the offending counter will be stopped and restarted unless it is already stopped in which case the job will just be started. An email will be sent to let you know that some action was taken.  In many cases, restarting the distribution agent fixed the issue and replication started to work again.  If this did not fix the issue then when this job next runs the same action will be taken and a second email sent.  At this point, if you aren’t already examining the situation then you need to start.

If you have a 3rd party alerting system you could just use a modified form of the 3rd script supplied to stop and restart the distribution agent jobs if any of the thresholds are met.  Otherwise, run script 1 to create the required table.  Create a new job and put scripts 2-4 in the first step, then put script 5 as the second step of the job and schedule this job for whatever frequency makes sense to you.  I am currently running this every 10 minutes.

This process should help with the period stoppages of transactional replication but automating the recovery process.  It is still important to periodically monitor the process using the Replication Monitor tool.  This process is only meant to help prevent the off-hours calls that can be easily fixed by kick starting a distribution agent job.