Optimizing Batch Process in SQL Server

SQL Server batch processes are usually run from SQL Agent in background. They can take significant time and resources, especially if they are ETL tasks. Quite often, the responsibility for creating these tasks belongs entirely to the developer. Dennes demonstrates that DBAs can advise and assist with this type of batch job by bringing their expertise to bear on the problem of reducing their impact on the working system to a minimum.

You often need to run batch SQL processes on a database as a SQL Agent task. ETL tasks, for example, are common on commercial databases; Data Warehouses also use batch process to import and transform data; OLTP systems have batches to process information.

I have worked in development teams that have given developers complete autonomy to create and run batch processes on the database servers. Because the many optimizations that can, and should, be done for a batch process so that it runs faster and more efficiently, a better result would often be achieved if developers can work together with DBAs. This is because a slow-running batch process can have a knock-on effect, draining CPU and IO resources, blocking other processes and locking shared database assets. Team work between developers and DBAs can avoid this, making for much better batch processes.

I will show an example of optimizing a batch process, and explain what you need to analyze while creating these process and I hope I will prove to you that this should be more than a developer task, because the DBA can make a positive contribution.

Test Environment

Development and testing is generally done on an isolated development server with a single user. The focus is on the creation of the batch process before the batch is deployed to production.

Our test environment has the adventureworks2012 database, but we will create a new database during the tests and import information from the adventureworks2012 database.

The script to create the test environment is this:

However, because we will change the database environment during the test, we will drop the database and create the entire environment again after each test.

The example batch that we will use for the test will be the following:

Our batch will insert millions of rows into our table Produtos, to simulate an ETL data-import operation that inserts many blocks of 1000 records. It works by not only importing the rows from adventureworks2012, but also doing a CROSS APPLY with spt_values table in master. This table contains numbers between 1 and 2049 that we can use for data simulation, exactly as we are doing.

We will also change this batch during the tests, to improve both the performance and efficiency of the batch process.

Metrics for the Test

Results

For each test run, we will gather some counters so that we can compare the results.

Execution Time

The execution time can vary a lot relative to your system, but the relative differences in time between the executions is likely to be similar.

Wait Stats

We will check wait stats, using the following queries:

We will also clear wait stats between the tests. This isn’t the kind of thing you should do in production, but of course you will run this in a test environment.

The instructions to clear the wait stats:

Performance Counter

We will also use Performance Monitor to check Lock Request/sec counter. We will see how important this counter is.

You can see in the image below how to select this counter in performance monitor.

2184-Screenshot%20(199)-b493d067-d8fe-42

Log Flushes

Last but not least, we will check the number of log flushes that are happening during the batch. A Log flush is the name given to the event where SQL Server writes the log cache to the log file and it happens after every commit transaction or when the log buffer is full.

This is a more difficult metric. We will need to create an Extended Events session to check this information.

‘Extended events’ provide a less intrusive event system introduced in SQL Server 2008 that allow us to capture several kinds of events, much more than SQL Profiler, which is now deprecated.

We will create the XE session with the following instruction:

The histogram object in the XE session will give us totals for a database specified by the database_id field. We will retrieve this information using the following query:

As you will have noticed, the results of the histogram are stored in sys.dm_xe_session_targets in a XML field.

Cleanup Sequence

We will also need a cleanup sequence to ensure the independence of each test execution. This will be the following sequence:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear the performance monitor window
  5. Clear the SQL Server cache
  6. Stop/start the logFlush XE session
    In the first execution, let’s only start the session:

First Execution

We will use the time taken by the first execution as a reference to see how much we can improve the batch, but of course it will depend considerably on your test environment. Let’s do the cleanup steps I’ve just shown, and then execute the batch.

The first execution took 08:04m

The first problem: Auto-Growth

The first thing to notice is that the size of both the data file and log file has increased after the execution of the batch, the data file much more than the log file. The database has auto-growth on and the amount of growth was determined by the database auto-growth configuration.

You can use sp_helpdb procedure to get the result in the image below.

2184-Screenshot%20(200)-eccd52b2-e178-45

Auto-growth is there to prevent disasters, and so is a great backstop, but we should try to avoid it happening in the middle of batch executions, because it makes our batch a lot slower. We need the auto-growth configured to prevent a disaster because it would be much worse to have our batch aborted by lack of space, but we will try to reduce the number of auto-growth events occurring during the batch. We can do this by doing the following:

  1. We will change the database and log file size to accommodate all the new data before the execution of the batch. The DBA can schedule this change to happen just before the execution.
  2. We will change the auto-growth configuration so that the files will grow in bigger increments. If we fail to predict the size needed for the files, we will still minimize the number of auto-growths during the batch execution.

We can check all the auto-growth events that happened in the server with the following script:

This script is using a system default server trace that is enabled by default and has the auto-growth information. The trace records its data in the file log.trc and we are reading this file using the function sys.fn_trace_gettable.

Let’s change this query a bit to get the total number of auto-growth events only for our database and only during the last execution of our batch:

Notice that we are getting the auto-growth events from only the last fifteen minutes, making it easier to retrieve the events triggered by our batch.

In our example, there was 84 auto-growth for the data and 13 auto-growth for the log file. These auto-growth events are making our batch slower.

2184-Screenshot%20(236)-f5e992de-cc8c-45

The solution is to make the files bigger for the data before you run the batch.

Here is the first proof that you would benefit from having the advice of a DBA when you plan the batch process: The DBA will help to predict the impact of the batch import of data on the size of your data and log files and, if necessary, correct the growth settings.

Planning the size of the data and log files

The new CREATE DATABASE instruction for the test system will be like this:

Notice that we are not only creating the database with bigger files but we also are creating the files with a bigger filegrowth value. If for some reason our calculations were wrong and the auto-growth still happen during the execution of the batch, it will happen fewer times.

In our example we are just changing the create database instruction, but for a production environment the DBA need to plan to change the size of existing files. The instructions would be like this:

In our test environment, it’s easy to predict the size: We already executed the script once, so the size needed is the final size of the data files after the first execution, with an allowance for variation.

This kind of calculation is the exception, because in test environment you mostly haven’t the same amount of data you have in production. The DBA will need to do some calculations to predict the amount of space that will be necessary for the data.

In our example, the INT fields has 4 bytes each (ID, ProductID, ProductSubCategoryID). The money field has 8 bytes (UnitPrice). The row has 20 bytes and one more to control the null field, making 21 bytes in all.

Each page has 8kb, so we will have 380 rows per page. This is an approximate calculation.

We are inserting a total of 3.096.576 (three millions, ninety-six thousands, five hundreds seventy-six) records. 3.096.576/380 = 8149 pages needed for these records.

Calculating the space: 8149 * 8kb= 65.192Kb. This isn’t the final amount because we aren’t calculating the index pages and a few more details, but already give to you an idea about how this calculation can be done.

Side Note – Why autogrowth is the first problem to check?

You are developing a batch process and checking how it performs. If you don’t check autogrowth first, each execution will have different results and it will become too difficult to analyze the result.

For example, the first execution could have a result with a lot of autogrowth during the execution, whereas the second execution could have a very different result because the files already have the space needed for the data.

There are plenty of possible results, the best way is to check autogrowth first and avoid it during the batch execution.

Second Execution

Let’s now repeat the steps to do a new execution with the larger data file and log file sizes:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear SQL Server cache
  5. Clear the performance monitor window
  6. Stop/start the logFlush XE session

Second Execution time: 07:38

Running the query again to retrieve the total number of auto-growth (you may need to adjust the minutes from last fifteen to last nine minutes) you will receive an empty result.

Monitoring the Auto-Growth

Let’s say something went wrong in the planning steps. How could you know that one batch already in production is suffering from the auto-growth during the batch?

Here some solutions you can use:

  1. Data Collector can create a graphical report that demonstrated the data file and log growth. You can read more about the data collector in my article ‘Centralize Your Database Monitoring Process’
  2. You can add the same query that we used to check the auto-growth events as a new step in your batch, adding the result to a new table that you will need to check later.
  3. You can use performance monitor to check the auto-growth. There is a counter named ‘Log file Growth’ and another one named ‘Data File(s) Size(kb)’. You can use Performance Monitor to capture this information to a file and check the file later.

2184-Screenshot%20(233)-32c8a6af-4594-4c

Second Problem: Locks

By observing the results of ‘Lock acquired/sec’ in performance monitor, we would notice a huge number of locks. Does our batch run concurrently with other tasks? If this is an overnight batch performing an ETL task during a quiet time, we probably wouldn’t need this number of locks because there is less risk of inadvertently blocking another process.

2184-Screenshot%20(202)-f36c4f9f-5d68-49

The locks have two sources: The INSERT statement in the batch and the SELECT statement in the batch. One solution for this is to use the tablockx query hint, so instead of getting a lock for each row, we would get a lock for the entire table. Notice that I will not use nolock for the select statement. Although nolock is theoretically a possible solution for SELECTs, it is likely to be dangerous because, in some circumstances, it can result in wrong data.

The new script will be like this:

Let’s do the cleanup again and repeat the execution:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear the SQL Server cache
  5. Clear the performance monitor window
  6. Clear and start the logFlush XE session.

Third Execution time: 05:41

You can observe in performance monitor a huge decrease in the number of lock requests/sec in the database, and this is reflected in the execution time.

2184-Screenshot%20(204)-073c9bdf-d83c-48

Monitoring Locks in Production

To monitor locks in production, you need to create a baseline. This means you need to be continuously monitoring the number of locks, so you set a baseline, the regular amount of locks in your system then you will notice anything different, especially when you deploy new batches in production.

You can use data collector to monitor locks in your servers and created the baseline. The article ‘Centralize Your Database Monitoring Process’ is a good start.

Third Problem: WriteLog

Let’s take a look at the wait stats to identify another main problem in our batch.

You will notice WRITELOG with a big value, 2561 in our example. This is causing problems for our batch. The first thought would be to get a better disk and to locate the log and data files on separate disks. Both solutions are good, but we need to return to the basics: Is our batch doing more log flushes than is actually required?

2184-Screenshot%20(205)-1c1cd19e-c8dd-4a

A log flush, writing the log to the disk, will happen at every commit that our batch does. Oh, wait! Our batch doesn’t have the commit instruction, does it?

In this case, we don’t have an explicit transaction, but every INSERT instruction has an internal transaction, and so it is as if each INSERT have its own BEGIN TRANSACTION and COMMIT TRANSACTION.

Because of this, we have many transactions in our batch and a lot of log flushes.

We can SELECT the result of logFlush XE session to find out how many log flushes happened:

Of course, this is a great query to transform into a table-valued function so you can reuse it easily, but this could be the subject for a future article about extended events.

2184-Screenshot%20(206)-b67bf26c-782a-47

We can see that we had 8871 log flushes during our batch.

The first solution we think of is simple: make the entire batch one single transaction. Now our batch will have an explicit BEGIN TRANSACTION/COMMIT TRANSACTION. The new script will be like this:

To try the new script, let’s repeat the cleanup steps again:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear the SQL Server cache
  5. Clear the performance monitor window
  6. Clear and start the logFlush XE session.

Fourth Execution time: 04:59

Fourth Problem: Auto-Growth again

The execution got a little better but not as much as we might have been hoping for.

Let’s check again the number of log flushes. It dropped, but not too much:

2184-Screenshot%20(209)-1600b7df-2799-48

The number of lock requests dropped again, this is good:

2184-Screenshot%20(207)-622821e4-de8d-46

We can check again the wait stats and we will notice that the number of writelogs decreased a lot, so the problem seems solved.

2184-Screenshot%20(210)-c130043f-64b3-42

But you might be surprised when you check the size of the database files:

2184-Screenshot%20(211)-0642d3ea-8e6a-42

So, we have auto-growth happening again. Check the total of auto-growths to confirm this:

2184-Screenshot%20(212)%20corrigida-927a

Why did the log file get so much bigger?

The transaction log space management is directly tied with the transactions and the recovery model of the database, which, in turn, is tied with the database backup plan.

We can check the recovery model of the database using the following query:

In our example, we are using the simple recovery model. This means that SQL Server will truncate the log after each checkpoint. Truncate the log means that SQL Server will remove from the log all completed transactions, while keep the ongoing transactions in the log.

Side Note

I said we are using simple recovery model, but that’s not ‘so simple’. We didn’t specified the recovery model in the create database statement, so the recovery model of TestBatch database is configured as Full.

Although the database is configured with the full recovery model, it’s not working in this way. That’s because it’s not enough for a database to be configured as full recovery model, the database needs a full backup to start working with the full recovery model.

This situation creates a ‘fake full’ recovery model: The database is configured as full recovery model, but it’s working as the simple.

To identify the ‘fake full’ case we can use the following query:

If the field last_log_backup_lsn is null this means the database is in simple or fake full recovery model.

However, if you find that your database isn’t using the simple recovery model, you shouldn’t change it without the help of the DBA and a lot of planning. If the backup plan includes log backups, the recovery model needs to be ‘full recovery’. If the backup plan does not include log backups, the recovery model should be ‘simple recovery’.

Here is the second proof that you would benefit from having the advice of a DBA who understands the relation between recovery model, transactions and log space and can plan the change of transaction log file size as part of your batch.

The approach to solve the log growth need to be different according to the recovery model. Let’s analyze the possibilities.

The batch with Full Recovery Model

When we use the ‘full recovery’ model, the log truncations do not happen in each checkpoint, but instead happen only when we do a log backup (notice that’s an error to use full recovery model without executing regular log backups).

To take control of log space probably the DBA would like to schedule a log backup right after the batch execution. The log backup will execute a log truncation after the backup, leaving the log with a lot of empty space.

Even if the exact amount of space couldn’t be predicted from the tests, the exact space needed in the log would become clearer after the first or a few more executions in production: It will then be possible to avoid all the log growth during the batch execution.

The DBA could choose whether to increase the log file size before the batch and return it to its regular size after the log backup or just keep the log file size always with enough space for the batch.

The batch with Simple Recovery model

Before we included the BEGIN/COMMIT in the script, we had many transactions in the batch, so at each checkpoint, SQL Server could delete all completed transactions from the log (the log truncation process) and the log would not grow too much.

Now that we have only one transaction, SQL Server will only remove this transaction from the log in the first truncation after the entire transaction finishes. Therefore, you will need to have enough space in the log for the entire transaction.

We need to make the log file bigger to support the entire transaction but, of course, we will not keep a huge transaction log only because of a single batch execution. The best approach is to increase the log for the batch and after the batch return the log for its normal size.

However, using the simple recovery model we do not really need a single transaction. We can afford a balance between the log size, the transaction volume and the log flushes.

At first, we had many small transactions, causing many log flushes. However, the COMMIT TRANSACTION isn’t the only reason of a log flush: if the log buffer is full, the flush will happen anyway. So, if we reduce the transactions to a single one, we will not reduce the log flushes to one and will greatly increase the space needed.

We can use bigger chunks than a thousand rows, but smaller than all the total amount of records, achieving this way a balance between log flushes, transactions and log space.

In our sample batch, there are a few solutions:

  1. Remove the Begin Transaction/Commit Transaction and increase the size of the chunk
  2. Change the position of the Begin Transaction/Commit Transaction to inside the first while: 2184-Screenshot%20(234)-5493e001-fb1f-40
  3. Create another while and counter variable to control the amount of records in each transaction

Changing our Script

As we just saw, there are plenty of possibilities to balance the log size, transaction volume and log flushes. I will proceed with the tests using a single transaction and leave to you to find the balance for your environment.

We can do a test to find the approximate amount of space needed for the log. We can insert one record in the table and check the amount of log generated. This will work better if the recovery model is simple, otherwise will be more difficult to find the information in the log.

The script to insert a record and read the log will be this one:

You can see the result in the following image:

2184-Screenshot%20(237)-8cecc8c7-7f4a-49

The rows between LOP_BEGIN_XACT and LOP_COMMIT_XACT are our transaction. We need to sum the values of Log Record Length field between these rows, multiply by the number of rows we will really have in our transaction and add the Log Record Length value for LOP_BEGIN_XACT row and LOP_COMMIT_XACT row.

Let’s do the calculations:
128 + 112= 240 * 3.096.576 = 743.178.240 + 124 (LOP_BEGIN_XACT) + 84 (LOP_COMMIT_XACT) = 743.178.448 bytes = 709 MB.

We saw in our tests that the log file grew up to 900 MB, but this happened because we rely on auto-growth feature, each growth create a new VLF (Virtual Log File) inside the log. By adjusting the log file to the right size, we will create fewer VLFs and will make better use of the space. However, we can use this value plus 10% and we can be sure that the log used will be less than this value.

You can read more about VLFs in the excellent Paul Randal article ‘Understanding Log and Recovering in SQL Server’ and you will also see more about these calculations in Paul Randal’s Benchmarking article.

Let’s change the CREATE DATABASE instruction to make enough log space for our batch:

To repeat: this change in the CREATE DATABASE instruction only works for our example. In the production environment, the DBA will need to change the size of the existing file. More than that: For the log file, the DBA would like to not only make the file bigger before the execution, but return the file to its normal size after the execution.

To return the log file to its normal size, the DBA can use a simple DBCC ShrinkFile if the recovery model is simple. Otherwise , the DBA will need to execute a log backup and then use DBCC ShrinkFile if the recovery model is Full.

Fifth Execution

Now we will repeat the clean up and do a new test run of the batch execution:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear the SQL Server cache
  5. Clear the performance monitor window
  6. Stop/start the logFlush XE session

Fifth Execution time: 03:18

Fifth Problem: Index and Fragmentation

The table that is receiving the data has indexes; one clustered and one non-clustered. We created the clustered index to enforce the primary key, using an always-increasing key defined with Identity attribute.

The problem is the non-clustered index: For each row inserted, the non-clustered index needs to be updated, slowing our batch. The rows are inserted over the entire index tree because the key isn’t always-increasing, so we have a lot of page splits during the batch, slowing even more the process and resulting in a very fragmented index, which will need to be defragmented after the batch.

We can use the following instruction to identify the index fragmentation:

2184-Screenshot%20(219)-d663705f-dfd9-45

As you might notice from these figures, the index is completely fragmented. In summary, our batch became slower, because it had to maintain an index that then became completely fragmented and will cost more time for defragmention. Did you noticed the LCX_INDEX_LEAF row when we checked our transaction in the log? This row is the update of the non-clustered index, making our log bigger.

The best solution: Drop the index before the batch and create the index again after the batch. Of course, this isn’t simple: there are a lot of points to consider.

Here is the third proof that you would benefit from having the help of a DBA to analyze the possibility of dropping the indexes and create them again after the batch, controlling this execution.

We will change the script that creates our environment so we won’t create the index until the bulk insertion is complete. The new script will be:

Now we will repeat the clean up and do a new execution:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear the SQL Server cache
  5. Clear the performance monitor window
  6. Stop/start the logFlush XE session

Sixth Execution Time: 3:12m

Sixth Problem: Balance the amount of log flushes

We still didn’t achieve a good balance with the amount of log flushes. This problem is very difficult to identify. There are a few wait stats that will indicate the problem: The Writelog was reduced but still happens, SOS_Scheduler_YIELD is high even considering that our batch query plan doesn’t use parallelism; Dirty_Page_Pool and LOGMGR_QUEUE, two wait stats generated by background process, are also high.

The key question is: How often will SQL Server flush the log?

Now that we have only a single transaction, the log flushes will happen when the log buffer become full. The log buffer is a structure that has a dynamic size between 512 bytes and 60kb. SQL Server decides the log buffer size according to the batch and log record size.

We need to adjust our batch to maximize the log buffer size and have as few log flushes as possible. To achieve this we need to adjust our chunk size, currently in 1 thousand records, to a bigger value.

There isn’t an exact calculation for this. We can do a few tests and use our Extended Events session to capture the log flushes, so we will find the best value.

First, let’s use the following insert for our tests:

We can run the insert above, query our XE session to find the amount of log flushes we had, then we clear the XE session (stop/start) and execute the insert again with a different chunk size.

Here the results I found:

Chunk Size

Log Flushes

1000 records

3

5000 records

11

10.000 records

22

20.000 records

44

30.000 records

66

You can notice the change in the proportion of log flushes. For 1 thousand records, we have three, but for five thousand we have only eleven, not fifteen. After that, the proportion is the same: eleven log flushes for each five thousand records.

This doesn’t mean five thousand is the perfect value: now it depends on our IO system. The next test is to run the batch with a few different chunk sizes and check the wait stats.

Here the results of my tests:

Wait Stats

1000 records

10.000 records

20.000 records

30.000 records

WRITELOG

244

193

5

73

LOGMGR_QUEUE

10.522

5.031

5.010

4.888

DIRTY_PAGE_POOL

7.174

917

484

606

PAGELATCH_EX

119

1

0

51

SOS_SCHEDULER_YIELD

56.857

11.546

6.602

5.344

You can notice in the above table the biggest improvement between 1 thousand and ten thousand records and you will notice that between twenty and thirty thousand records some wait stats increase, that’s not good. Therefore, for my test environment, the perfect chunk size is twenty thousand records.

There are some calculations we can use together our test results. We can use the log IO limit, the maximum amount of log IO SQL Server will allow to be “in-flight”, meaning that SQL Server requested the write operation and is still waiting the acknowledgement. After this limit, all the IO’s will wait.

The log IO limit is 3840K. We already checked our record size in transaction log, it will be 128 bytes (the insert in the clustered index). 128 * 20.000 = 2.44MB and it’s important to notice that the real value will be bigger than this, other operations will happen in the log during the transaction, like extent allocations. Calculating with 30.000 * 128 = 3.66MB, it will reach the log IO limit, so the wait stats increase.

Side Note – Why did we leave this optimization for last?

Find the balance of log flushes depends on the size of our transaction in the log. On the other hand, the size of our transaction in the log depends on dropping (or not) the indexes of our table.

Seventh Execution

Now that we found the chunk size we should use, let’s change our batch to use this value:

Let’s do the cleanup again and execute our batch:

  1. Drop the database if it already exists
  2. Run the script to create the environment
  3. Clear the wait stats
  4. Clear the SQL Server cache
  5. Clear the performance monitor window
  6. Stop/start the logFlush XE session

Seventh execution time: 00:22s

Further Improvements

There are a few more improvements you should test in you environment:

  • If you are sure your data is already correct, you can disable constraints and re-enable after the batch.
  • If your table have triggers, should they really be executed during the batch? You can disable the triggers.
  • Keep track of the wait stats in your server to identify what’s slowing down the server.
  • You can check the possibility to use the Bulk-Logged recovery model and use minimally logged operations, like INSERT/SELECT. However, the restriction for an operation like INSERT/SELECT work as bulk logged are very restrictive, it will help only in a few cases. You can see more about this in https://technet.microsoft.com/en-us/library/ms191244(v=sql.105).aspx
  • You can use the new feature of SQL Server 2014, delayed durability, to improve the batch performance.

Conclusion

You saw several ways that we can improve the performance and efficiency of a batch process. If you are developing a regular scheduled batch process, then get the help of a DBA, because DBAs and developers need to work together to make well-performing batches. It is also worth discussing with the DBAs who will be maintaining the database in production how best to keep a logged check of the performance of batch jobs in case circumstances change and batches start to perform poorly in production.