Painless management of a logging table in SQL Server

Tables that log a record of what happens in an application can get very large, easpecially if they're growing by half a billion rows a day. You'll very soon need to devise a scheduled routine to remove old records, but the DELETE statement just isn't a realistic option with that volume of data. Hugo Kornelis explains a pain-free technique for SQL Server.

Many databases have them: a table that is used to log “everything” that happens. Rows are added to these tables all the time, and we all know to prevent page splits by clustering on an ever increasing column, like the date/time of the event, right? We just keep adding data, usually in a “write-only” way. The only time these tables are ever read is when troubleshooting a weird issue, or when the blame game has to be played.

Of course, if you keep adding data to a table, it will grow. And it will eventually grow too big. At that time (or, hopefully, before that time), you’ll have to do something to tackle the problem. Unless you are under some legal obligation to retain this logging information “forever”, you can usually just clear out old stuff. For small tables, this is easy:

But for large tables, this is not good. Even if the LogDate column is the clustered index key (which it should be – please don’t add an IDENTITY column to these tables!), you still don’t want to run this if it would delete several millions of rows. It will bloat the transaction log because all changes have to be logged, and until the transaction has finished (which can take a while), the transaction log will not be cleared, even after a log backup. Locks on the clustered index will not cause a problem (you are deleting rows on one side of the index and adding on the other side), but unless you disable lock escalation, you’ll soon have an exclusive table lock blocking all other access. And even if you do disable lock escalation, locks on the nonclustered indexes will soon start to block other access as well!

The usual solution is to batch the delete:

This prevents lock escalation, and because the transaction autocommits after each statement, it will never hold locks for a long time, and it will not prevent transaction log reuse. It will still cause a significant strain on resources, as all rows have to be deleted, nonclustered indexes have to be updated, and all changes are logged in the transaction log. However, if you schedule this as a task during low traffic hours and backup your log often enough, this will work nicely for deleting a few million rows.

Really, really, really large!

But what if the table is more than just large? What if you have to delete hundreds of millions of rows? What if the table grows by half a billion rows per day, and by the time you notice the problem, you have to delete 15 days’ worth of data? Do the math – that’s 7.5 billion rows to be deleted from a table that is probably somewhere between 10 and 15 billion rows big. Do you think you can still use the batched delete technique? Good luck with that! The first time I encountered a logging table this size, the client found that the only way for them to free up space was to use TRUNCATE TABLE to remove all the rows. Of course, if something happened that required them to check the logs right after that, they were hosed…

My solution was to partition the logging table. That enabled them to switch out partitions with old data to a helper table (a change of metadata only, so very fast), truncate the helper table (also very fast because complete pages are deallocated all at once instead of deleting individual rows, and only the deallocation is logged), then repeat until enough space was available. Of course, this also requires them to regularly create new partitions. And though this all worked like a charm, they really wanted this to be setup to work completely automatically. That was a bit harder – but it’s possible, as this article shows.

Getting started

If you are lucky, you can set up the table with partitioning before the system goes live and the table starts filling with data. So, how often are you that lucky? Yeah, thought so. Same for me. Hundreds of millions of rows already in the table.

If you search the internet for how to implement partitioning on an existing table, you’ll be told that you’ll have to rebuild (or even drop and recreate) the clustered index. Do you have any idea how much time and resources that takes when you do that for a table this size? All that time, the table will be locked; that alone forbids this action in a 24×7 shop. If you do have a sufficiently large maintenance window, you’ll still have to deal with a transaction log that will grow, grow, grow, and grow even more.

Before I tell you how I managed to avoid that performance hit, allow me to first set up a demo database and a fake logging table:

Instead of changing the table to partitioned, or copying all data (which effectively has the same problems), I want to create a new, empty table with the correct schema and the required partitioning scheme. I can then use the technique of partition switching to switch in the data as a metadata-only change. Unfortunately, switching a table into a partition of a partitioned table requires a trusted CHECK constraint to be in place, and adding a trusted CHECK constraint requires SQL Server to read all the rows in the table. During this operation, there is a schema-stability lock on the table, excluding all access to the table. And though SQL Server is smart enough to avoid reading all rows from the clustered index, it will read all rows from one of the nonclustered indexes – which on a table this size, can still take quite some time. I was unable to find a way around this. It’s better than having to rebuild the entire clustered index, obviously, but if anyone knows a way to avoid this hit as well, it would be a great benefit for this scenario.

Because I didn’t want to hardcode the date boundary, I use a query with MAX to find the current highest value for LogDate (for this query, SQL Server is smart enough to read only the last row from the clustered index – something it should do when adding the CHECK constraint, but unfortunately doesn’t). If this is earlier then today, I increase it to today (so that additional log data coming in after the change will still go to the first partition and the second partition remains empty – I’ll explain why this is important later in the article). I then use this to calculate the boundary for the partition function, and for the CHECK constraint. Here is the code:

I can now create a helper table, making sure to use the exact same columns and also to create the exact same indexes as the original table. The only difference is that the tables and all the indexes have to be allocated to the partition scheme instead of a filegroup.

And now, finally, I can make the change. I switch the old table into the first partition of the new one; this change is implemented by changing metadata only, but it will result in all data effectively being in the new table, and the original table being empty. I then drop the old table, and rename the new table and all its indexes. (Renaming indexes can be skipped if the company has a naming policy that allows for freedom in the choice of index names). I do all this in a single transaction so that all access to the table is blocked until all changes are done, and to ensure that in case of an error, all changes are undone. Regardless of the size of the table, this transaction should not take more than a few milliseconds – though it can take some time on a busy system before it gets the locks that are required for the partition switch.

After executing this code, your Logging table will have two partitions. The first one contains all the existing data, and data flowing in will also go to this partition. The second partition will be empty, and stay empty until midnight. After that, new data will go to this partition. You must make sure to start regular maintenance of the partitioned logging table before that, or you’ll get new problems!

I recommend immediately invoking the MaintainLogTable stored procedure (explained below) to ensure that new partitions are made for the next few days. The table will still grow for some time, because the original contents are now all in the first partition, which will not be removed until all the data in the partition expires. So you might want to consider running with a lower retention period for the first period, or investigating if you can temporarily use some extra storage, until this first, large partition is removed. After that, the dbo.Logging table will have, and keep, its normal working size – provided you ensure that regular maintenance is done at set times (preferably daily).

Regular maintenance

So now you have a logging table that is partitioned, either because you allocated an empty table that way before the database goes live, or because you used the method described above to create such a table and get all the data in one of the partitions. Either way, it is very important to ensure that the “last” partition always remains empty. So assuming that the logging date is always equal to (or, maybe, less than) the current time, you should ensure that the last partition boundary is set to a date in the future. The reason that this partition has to be empty is that it is very easy to create new partitions by simply splitting an existing partition. If the partition is empty, this is a metadata operation. But if there is data in the partition, SQL Server will have to process all the rows and put each in the correct new partition – something we will try to avoid, given the size of the table!

In order to ensure smooth operation of the partitioned logging table, we have to create new, empty partitions to receive new incoming data every day (BEFORE data starts going into the last partition!), and remove partitions that are old enough to only hold expired data. All this is done in the MaintainLogTable stored procedure. (The code of this procedure is attached to this article). So all you have to do is create that stored procedure on your system (ensure that you change the names of the table and the partition function to match those on your system!), create the helper table used for purging expired data, and create a job that invokes this procedure on a daily schedule.

The stored procedure

I’ll describe how the code works later in this article; this section is only on how to use it. The stored procedure is called with three parameters, all of which have defaults.

The first parameter is @DaysToKeep. This specifies how long old data is kept. This is often a tradeoff. Keeping data for a longer time can be useful for some troubleshooting (or blame-gaming) scenarios – but with hundreds of millions rows of data coming in each day, that incurs a significant cost. The default of 14 for this parameter ensures that data is only removed when it’s at least two weeks old.

The second parameter is @DaysInFuture. This parameter determines how many new partitions will be made. It should always be at least 1, so that a new partition is made for tomorrow’s data. If you want some resilience to possibly skipping the maintenance job a day, set it higher. This will create a few extra empty partitions. Aside from a few rows in the system tables, there is no cost involved with this – that’s why I’ve set the default to 7, so that you are still okay if you forget to run the maintenance job for a week – remember that the last partition should always remain empty; creating partitions for several days in the future ensures that this condition will be met, even when the maintenance job doesn’t run the next few days. (If you plan to run the job only once a week, set it to a higher number!)

The third parameter, @CurrentDate, should never be used in normal operation. I used it to help me test the code without having to wait for the next day, and I decided to leave it in for if you ever run into a case where you want the procedure to behave as if it was running on a different date.

So if your logging table stores data that has to be kept for at least a month, and you think that the default of preparing 7 days in advance is okay, you can simply add this line of code to your daily maintenance plan:

If you plan to run the procedure once a week and think the default retention period of 14 days is okay, you use:

And if you need to override both defaults, your maintenance job would include something like this:

Helper table

The procedure requires the existence of an empty helper table with the exact same schema and indexes as the logging table – except that this helper table and indexes are not partitioned, but allocated on a normal filegroup. This filegroup has to be on the same filegroup that houses the partition scheme.

This table will be used by the MaintainLogTable stored procedure for fast purging of expired data – partitions are switched in (a metadata operation), and then the data is quickly purged by using TRUNCATE TABLE.

Manual intervention is required

Did I already mention that the last partition must always be empty? Well, if it isn’t, then the stored procedure will fail. It will throw an error, with this message:

The most probable cause of this is that you skipped maintenance for a few days, and data has started to flow in the last partition. If this is the case, the best way to solve it is to simply split the partition manually, with the new boundary such that all existing data goes to the left half of the split, and the right half is empty. SQL Server will need to scan the data to check if any rows need to move to the new partition, but since this is not the case, the operation should still finish fairly quickly – at least a lot quicker than when you use a boundary such that part of the data actually has to move!

After this, you will have a partition that spans a few days, and none of the data in that partition will be purged until the most recent log entry it holds expires. So your table will temporarily require some extra storage space.

Another cause can be an application error that caused a row to be entered with a LogDate that is far in the future. The fix for this is to change or delete that row. You cannot use the same procedure as above, because otherwise you’d get a single partition for maybe a year, and that would take way too much storage space!

The first query below will show you all your partitions, with the amount of rows in them and their boundaries. The second query will show you the highest actually used value. Use these these (after changing the names of the table, partition scheme, and partition function to match your database) to assess the appropriate way to fix the problem.

If you find that there is suspect data, with dates in the far future, you’ll really have to fix this based on your specific situation; I cannot give you the code for that.

How it works

Of course, you are not the kind of DBA who would ever blindly copy code from the internet. You first want to know exactly how it works. Not only because of a healthy dose of paranoia, but also because that would help you learn, and ultimately get better at your job. Good for you!

In this section, I’ll show the code of the stored procedure and explain what it’s doing.

This is the first part of the procedure. (The actual code also contains a comment block with a short explanation of how to use the code, and a copyright notice). It is not possible to use a system function like CURRENT_TIMESTAMP in the procedure’s parameter declaration, hence the extra conditional code to set @CurrentDate to CURRENT_TIMESTAMP if no other value was specified.

This is also fairly obvious. The DMO sys.partitions is used to find the number of rows in the rightmost partition, and if it’s not zero an alert is raised and execution halts. Someone will have to intervene manually before automated management of the partitioned table can resume. If your logging table has a different name or is allocated in a different schema, you’ll have to change this in the query, and in the error message.

There will be rows in sys.partitions for each of the indexes on the table; I filter on number 1 (the clustered index) because there always will be a clustered index, and I can be 100% sure that this is not a filtered index – so a rowcount of 0 really means that there are no rows in the partition.

To find the current highest boundary value (the threshold between the last and the second to last partitions), I query the DMO’s sys.partition_functions (to find the internal numeric identifier for the partition function by name – please don’t forget to change the filter in the query if you have named your partition function something other than ‘pfLog’) and sys.partition_range_values (to get the actual boundary values).

I then start a loop that will continue executing until there are partitions up to @DaysInFuture in the future. In the loop, I compute the new boundary value, then use ALTER PARTITION SCHEME and ALTER PARTITION FUNCTION (don’t forget to change the names here as well!) to tell SQL Server where to allocate the next partition, then split the currently highest partition based on this new boundary date.

The reason I do this in a loop is resilience to skipping the maintenance jobs. If you missed a day, then run this procedure the second day, the loop will run twice, creating two new partitions for two new days. With a simple IF, you’d get a single partition for two days, which would make it harder to manage the storage requirements of the logging table. An added benefit of this is that it’s very easy to get started with this method if you’re setting up a new system – just create the logging table using a partition function with a single date boundary that can be anywhere in the past; the maintenance procedure will immediately create partitions for all days since, up to the specified number of days in the future. (And the second half of the procedure will then promptly remove those that are already older than the specified retention period). Just make sure to pick a date that is less than 40 years in the past, or you’ll run into problems with SQL Server’s 15,000-partition limit.

This is the final part of the procedure. You will, again, have to modify the name of the partition function (in three places), the name of the table (in one place), and the name of the helper table used for purging data (in two places). I use a query very similar to the one in the first part of the procedure but with a different ORDER BY direction to find the date that defines the boundary between the two oldest partitions. If this is more than the specified retention period in the past, I can purge the oldest partition – and I again repeat this in a loop, so that I catch up automatically even after skipping maintenance a few days.

The actual swapping uses ALTER TABLE to switch some pointers in the metadata – the pointers to the actual data for the first partition of dbo.Logging (which contains a lot of data that should all be removed) and those for the actual data of dbo.Logging_Purge (that should be empty, because it is never used) are swapped around. After this, dbo.Logging_Purge now contains the data that was in the first partition of dbo.Logging and that we want to purge, which is done by the TRUNCATE TABLE statement. The first partition of the actual logging table should now be empty, so that the ALTER PARTITION statement that merges the first two partitions can be performed as a metadata-only change.

Because this is done in a loop, and because manual intervention with the partition ranges can cause partition boundaries to be less or more than a day apart, I end the loop with a copy of the query to get what, after the merge, is the boundary between the partitions that are now the two oldest. This value is then used to determine if the loop has to be executed again.

Variations

I created this method to deal with a logging table that received hundreds of millions of rows per day. But it can be used for other purposes as well. Smaller logging tables might not really need this method, but it doesn’t hurt – so you can choose to use it for all logging tables, regardless of the daily volume of rows. Or even for all other tables that need to be partitioned by day.

And it doesn’t stop there. If you don’t need daily partitions, but would like to have partitions for each week or each month (or, with an even higher volume of logging, maybe even every hour), you can easily change the procedure to support that. Unfortunately, SQL Server does not support using a variable for the ” datepart” parameter of DATEADD and DATEDIFF functions, otherwise I would already have added this. But a quick search and replace can easily fix this for you.

If you want to use automatically maintained partitioning on multiple tables, you could try to change this procedure to support a "tablename” parameter. But that would require you to use dynamic SQL, and for lots of reasons I try to avoid that as much as possible. It’s probably much easier to just make a few copies of this stored procedure, give them different names, use search and replace to change the names of the table, the helper table used for purging, the partition function, and the partition scheme, and then create all those stored procedures in SQL Server. There is no measurable cost involves in having a few extra stored procedures, so really, there is no need at all to resort to dynamic SQL!

Conclusion

Partitioning is an ideal way to manage very big tables where old rows regularly have to be purged. But it does require that new partitions are created and old partitions are actually removed at regular intervals. The stored procedure I presented in this article helps you  by completely automating these tasks. The only thing you still need to do is to make sure the procedure is regularly executed, e.g. by scheduling it as part of your maintenance tasks.

Moving an existing, non-partitioned table into a partitioning scheme normally requires an enormous amount of resources and a considerable downtime. This can be reduced by a huge amount by using partition swapping instead of rebuilding the index. However, this does require that a CHECK constraint is created first, which will still take a lot of resources and lock the logging table for a prolonged period – though not quite as long as rebuilding the index would take.