Friendly Deletes in SQL Server

One of the first things you learn when working with SQL Server, and other transactional based SQL systems, is that set based operations perform best. If you are querying data, a cursor pulling individual rows doesn’t perform as well as a single query. It doesn’t matter if that cursor is on the client side or the server side. A set-based operation is more efficient, runs faster, locks less, and is generally better than submitting multiple queries.

This is also generally true with delete statements. This post covers the exceptions to that rule. Large delete statements impacting many rows and large amounts of data (millions of rows and many gigs of data) can actually have decreased performance. With transactional systems, such as SQL Server, each transaction follows the ACID standard. Part of that standard ensures that transactional statements either complete or roll back fully – partial transactions are not allowed. For a delete statement, that means that all of the rows specified by the delete are removed from the table, or none are removed and the data rolls back to the original state. The delete and rollback behavior must be predictable and consistent or the data could be left in a contaminated, unreliable state. Performing very large deletes can present some challenges and needs to be treated differently in production systems.

Use Cases for Smaller Deletes

When a table, such as a log table or a staging table used for imports becomes too large, you may not have noticed the growth at first, or you may be brought in to help with the table. It is normal to try to delete all of the excess rows with a single statement. If you’ve done this before, you know it can take an extremely long amount of time and cause problems during the process. And if it fails, it takes a long time to roll back the transaction.

There are several advantages to using smaller, friendly deletes for these situations. You have to consciously decide to use many, smaller transactions, but it is an effective technique for these edge cases. This is an edge case, so you don’t want to do this as a standard pattern for all deletes. I would only use this when you see an issue or know that the data to be deleted will be problematic. But when it is needed, the benefits outweigh small increase in development time.

The transaction log can increase in size considerably when using a single transaction for large deletes. Using several, or even many, small deletes can minimize or eliminate log growth completely. Increasing the size of the transaction log incurs a performance hit. It also increases the size of log files unnecessarily, which can make managing the system more difficult.

Several smaller deletes can actually be faster than a delete in a single, set-based transaction. This may be the most unintuitive aspect of friendly deletes. Working with SQL Server, one of the absolute, unbreakable rules is set-based operations and no loops to work with your data. This is a good rule, until you need to break that rule. Friendly deletes are one of the patterns where you break that rule.

Breaking up the large delete also means that there are multiple transactions for the delete. This makes the process restart-able. Any delete sets that complete are done and won’t need to be run again in the case of an error later in the process. The next execution of the script will continue at the next iteration. This can be useful when the process is scheduled and happens on a regular basis. It is also useful if there is a problem with one of the smaller delete transactions. With a friendly delete, the rollback happens quickly. When executing a large, single delete transaction and it fails, that rollback can take a very long time, even longer than the initial delete process. Small transactions make this much less painful and easier to manage.

Reducing the size of the transaction also reduces the blocking potential of the delete. Each iteration of the friendly delete process will complete more quickly than the single set-based delete, which. allows other queries to run between iterations. The entire delete may take longer than normal, but you can tune the deletes so the blocking processes happen quickly enough that users don’t notice the difference. Smaller deletes also reduce the chance that the lock is escalated to a table lock. If the entire table is locked with an exclusive lock so the delete can complete, regular processes won’t be able to access that table, even if you are using row-level locking. This may make the overall process take longer, but it becomes a much more usable, friendly delete.

Between the smaller transaction size, decreased locking and blocking, reduced log file usage and growth, reduced chance of timing out and rollbacks, and overall improved user experience during the delete process, friendly deletes are a good pattern when needed. They are excessive in many cases, but very large deletes need to be managed differently than normal deletes in an application.

Methods for Friendly Deletes

I’ll present a few methods for friendly deletes, but the actual method isn’t important. There are a few rules to follow, and if those rules are met, the strategy works well.

Each delete subset should be a complete unit of work. If no additional deletes are performed, the data should be in a consistent state. It’s hard to get this part wrong. If the subset should be deleted, the data will be in a consistent state when it is done. It also helps if the subset of data is small enough. This also helps with the restart-ability. If the entire process fails or times out, the next time it runs it should pick up where it stopped during the previous execution.

The other main rule you need to follow is deleting small subsets of data. The subset should be small enough so that each iteration completes quickly, doesn’t excessively block, but is as large as possible so the entire process completes quickly. You could delete each row individually, but that would take even longer with no added benefit compared to the small subsets. You need to find a balance between a single row and the complete set of data. This is another rule that is hard to get too wrong. You will want to test your subsets before you try this in a production environment, but any reduction in the size of the delete set will be an improvement. After that, it is a matter of degree of improvement. I often use 10,000 rows as a starting point and adjust the query from there based on how it performs.

Potential Issues with Friendly Deletes

There aren’t many downsides to using a friendly delete process, but I’ll mention a few things to consider. The primary consideration is that this is a more complicated pattern than a simple, set-based delete. It requires more development and testing. It is possible to create an endless loop or delete excessive rows, but these issues would both be due to bugs in the script. This is another reason for testing your process thoroughly.

The delete batch script can timeout, especially with very large tables that require a lot of deletes. Your friendly delete process should be restart-able, which will take care of the issue. You just have to be prepared for that possibility in your job. The method you use to schedule the friendly delete will impact the timeout and what happens if a timeout occurs. Be prepared to handle a timeout exception if this is a possibility with your execution method.

This isn’t much of a liability, but these small subsets of data commit very quickly. If you make a mistake, you likely won’t have a chance to rollback your changes. This is mitigated by testing. This type of script should not be put into production without thorough testing in your lower-level environments. This is true of all scripts going to production, but it can have a much larger impact if you have to perform a restore in production.

Examples

For the following examples, I created two test log tables and populated them with semi-random data. The first table has more rows and the log column is generally wider. It contains 6 million rows and uses about 26 GB of data space. The second table has a four-column natural key and is used to demonstrate methods to remove duplicates. It contains 5 million rows and uses approximately 4 GB of data space.

The starting sizes of the tables are shown below.

I performed a shrink operation on the log file when I was done to give a better baseline for log growth. Normally, you would want the log files to be bigger in a production system. A starting size of ten percent of the data space is a good starting point for transactional systems, but this will vary depending on usage patterns.

When the test bed was ready, I did a full database backup. This allowed me to quickly reset everything to an identical starting point for each test. These tests don’t rely on any other tables, so it could be done in any database, but the WideWorldImporters sample database was used in this instance.

A note on log files

The log file will increase in size but it does not automatically shrink. The assumption is that if the log file hits a certain size once, it will need to be that large again in the future. There is a performance impact to increasing the file size and also for shrinking files, so it isn’t standard to shrink the log file automatically. The actual data in the log file will be emptied out as transactions complete, so when a one-time process increases the size of the log, it will become unnecessarily large. The other time you will see a disproportionately large log file is if the database is in FULL recovery mode and transaction log backups aren’t happening. In this scenario, data isn’t flushed from the transaction log and it continues to increase in size. This is an indication of trouble with your maintenance procedures and you will want to address this immediately. In SIMPLE mode, the transaction log only grows for particular transactions that need more space than the current size.

Delete Example 1 – Single transaction

The first example is the baseline, the query you are trying to avoid. This query deletes all impacted rows in a single transaction. When a query like this runs in a production environment, you will notice the issue.

The single transaction reduced the data size from about 26GB to about 2GB. The log size increased from 100MB to about 115GB. The entire process took 822 seconds. That’s our starting point. Ideally, the next process would improve the time and the log growth would be limited. At a minimum, we want smaller transactions and minimal log growth even if the time doesn’t improve. Minimizing the time for each loop will at least allow other queries to run.

Delete Example 2 -Delete Loop

The next example uses the same table, but uses a WHILE loop to process 10,000 rows per delete. It is more complicated, but not overly so. The rows to be deleted are put into a temp table and that is used for the looping process. When the temp table is empty, the deletes are complete.

This example took 130 seconds to complete, even though we aren’t using a single set. The starting and ending data size is identical to the previous example, which is expected. The other benefit of this example is the log growth. It went from 100MB to 1060MB, a huge improvement from the baseline. As mentioned above, in a production system the log file(s) size would likely be larger than this and it wouldn’t increase in size due to this delete.

Delete Example 3 – Dedupe via a CTE

This example uses the second log table. It assumes a natural key of LogYear, LogDay, LogCategory, and LocationCode for these tests. The reason for the delete in this use case is deduplication of data, a common need in ETL packages.

This example runs fairly quickly, so the looping may not be necessary depending on your scenario. It finished in 74 seconds. Data went from about 4GB to 4MB and the log file increased from 100MB to about 22GB. The log file growth is a potential issue. If this is running in a production, transactional system, the 74 second set-based delete may also cause issues. If the table were larger, that time would obviously increase. In a warehouse scenario, this may be sufficient and a friendly delete could be skipped.

Delete Example 4 – Dedupe with a partition

This example continues on the table needing deduplication. This example shows deletes happening by using the partition number in a loop. It isn’t a completely even distribution, but it is an easy way to make a small loop. This would be modified to fit your business logic. You don’t want the loops to be too big or too small. It takes some analysis of your data to find the right balance.

This example doesn’t run as quickly as the single, set-based operation, but you can see that the log file doesn’t grow much. The log only went from 100MB to 164MB. It also is very fast between iterations. In my testing it was around 150 ms per loop. That would not create a blocking issue and helps keep the database responsive.

Delete Example 5 – Dedupe with a static sized loop

This pattern is the same as example number 2 above. A static number of rows is deleted during each loop. The only difference between these two is the logic for determine which rows are getting deleted. The former example uses a date and this example is deduplicating based on business logic.

The following is the relevant code snippet showing the actual delete portion.

This performed slightly better than the previous method, but not enough to say it is definitively faster. The log growth was the same, but the delete process was a little longer for each loop, 305 ms. This is still fast enough that it would allow the system to remain responsive.

Delete Example 6 – Dedupe using a natural increment

The final example shows deduplicating without a temp table, using part of the natural key to loop. In this instance, a Year column is used. The same concept can be used with a date column or any type of dimension can be used for the iterations.

This version was the fastest overall strategy for the deduplication process, but it had two downsides. The log grew more than the other friendly deletes and the time per iteration was also longer. The log file grew to almost 6 GB and the iterations averaged 4.5 seconds. If a smaller dimension or portion of the natural key were used, those two items would be mitigated. If this was used in a warehouse, the iteration time would likely be fast enough.

Summary and Comparison of Delete Strategies

Test Name

Total Time – Seconds

Starting Data MB

Ending Data MB

Starting Log MB

Ending Log MB

Starting Row Count

Ending Row Count

Loop Duration – Seconds

Baseline – single transaction

822

26836.75

2238.0

100

115556

6000000

500000

822

10K Loop

130

26836.75

2238.0

100

1060

6000000

500000

.202 (202 ms)

Natural key – Single Transaction

74

4176.5

4.125

100

22564

5000000

4380

74

Natural key – partition

181

4176.5

4.125

100

164

5000000

4380

.150 (150 ms)

Natural key – loop

170

4176.5

4.125

100

164

5000000

4380

.305 (305 ms)

Natural key – increment

62

4176.5

4.125

100

5988

5000000

4380

5.16

The above table shows representative data from each type of delete demonstrated, friendly deletes and conventional deletes. Log growth is a clear issue with the larger deletes. Exclusive locks leading to blocking are the other concern. Many different methods can be used to make your friendly deletes. Each type just needs to partition the data into smaller sets, allowing faster commits and reduced log growth. I have found looping based on a set number of rows to work well, but deduplication based on a natural key works too.

Test and Schedule Friendly Deletes

If this is a log table or other table that gets repopulated on a regular basis, you will need to schedule your friendly delete. The method used will depend on your processes and how the table is used in your processes.

ETL

If this is part of a regular ETL process, it is easy to add the friendly delete to your existing process. It just needs to be placed in a SQL task. The correct location in your ETL process will differ, but it will usually be run before your import and update process instead of your standard delete. Be sure the timeout is sufficient. Another option is to put the delete in a separate ETL, especially if it could fail due to a timeout. Your other processes should continue if this package fails. If you expect the ETL to fail you still need to check your error logs to ensure it was due to a timeout and not a different error.

SQL Agent Jobs

If you are using on-prem servers, SQL Agent can be used to schedule a job. This can be useful if the table is populated by an outside process. It can also be useful even if it is populated by your system. This is good for independent tables that aren’t part of an existing ETL process.

Elastic Jobs

If you are in an Azure environment, Elastic Jobs can be used. Elastic Jobs are still in preview (beta), but are a viable way to schedule your job. This is similar to SQL Agent Jobs and the same use cases apply.

Deployments

In a CICD environment, deployment processes can be used to run SQL scripts. Most maintenance scripts can be run with these processes. The biggest issue I have seen with this scenario is the timeout value for a batch. Some maintenance tasks, friendly deletes included, can run past the timeout limit. This is mitigated by the restart-able nature of friendly deletes. It may take several executions for the entire process to finish, but it will eventually get done.

One-Time Process

Your script can also be run as a one-time process. The default for SSMS is no timeout. This allows administrators to run your script and let if finish, no matter how long it takes. If you use this method, the administers can also run scripts to shrink the log file and rebuild indexes after the deletes are done.

Additional Considerations

If you need to manage a delete in this fashion, there is likely a bigger issue that needs to be addressed. Other data systems may be more optimized for unstructured or semi-unstructured log data. Those systems can also be cheaper to maintain for a large amount of data. There is a large degree of convenience keeping the data or at least a subset of log data in your regular SQL database, but consider putting it in another system.

If you have to create a process to manage deletes, it’s a good idea to explore why the table is so large. There may be an error in the process or excessive logging may be happening. This is a good opportunity to look at the root cause and try to address it there. It may not always be possible, but that is a better solution than creating a friendly delete process. It’s also a good time to explore if you really need that much logging data. At a certain point, it becomes difficult to parse that data. For most scenarios, there is a limited amount of time where the logs are useful. If you need to maintain log data for a longer period of time, it may make sense to process the log data and create a smaller, aggregated version. The friendly delete process can be used to cleanup rows after they are aggregated into the new structure.

In the case of an extract, the source system might be able to fix the extract or remove duplicates before you ingest the file. This is more efficient in multiple ways. It uses fewer network resources and also simplifies your ETL process. And generally, the data owner will understand the data fully and be able to shape it correctly based on your needs. This isn’t always the case, but it is worth checking.

Another consideration with tables that are growing quickly, or are larger than you expect, is the identity column. If you use an identity column for the key (auto increment column), remember that an integer has a maximum value of just over 2 billion rows. This sounds like a lot, but it can get exhausted in these scenarios. If you have to fix this after the limit is reached, it becomes a much bigger problem. It’s easier to anticipate this issue than to fix the problem. You can fix it by changing the datatype. For instance, you can move from an integer to a bigint. A decimal column can also be used. Another tactic is to reseed the integer column in the table. This requires shuffling data and isn’t as straight forward, so it wouldn’t be my first choice.

Refer to the SQL Server documentation for the size limits between int and bigint. You can see that an integer holds values up to 2^31-1, which translates into rows when used as an identity column. That just over 2 billion rows (assuming you don’t start in the negative range). A bigint holds values up to 2^63-1. If you look this up like I had to, it is 9 quintillion rows and you likely won’t run out of rows before you have other issues with the system.

The following script shows all identity columns in the database and the remaining numbers left depending on the data type definition. It’s worth looking at these limits if you find an excessively large table.

Summary and Analysis

Large deletes can introduce issues that aren’t found in other set-based operations. Conventional query techniques may need to be set aside to allow the delete to complete faster or at least for the delete process to have minimal impact. Deletes on very large tables can cause log file growth, block other queries, and take an extremely long time to rollback.

Breaking up large deletes and performing multiple, smaller deletes solves these problems. It minimizes log growth, reduces blocking, reduces rollback time, is often faster than a single, large delete, and is restart-able. This can improve the user experience during the delete process in addition to reducing the chance for timeouts for a large delete. It isn’t an intuitive pattern if you are used to creating set-based queries, but the extra programming overhead is worth the effort.

Finally, if you find a table that needs to use friendly deletes, it’s a good idea to validate the design pattern used. You might be able to move logging to an external system or eliminate the duplicates that are filling up the table rather than dealing with them after they become a problem. You can always use a friendly delete for cleanup, but preventative maintenance is better.

Each system is different and if performance is crucial, the method you use will need to be tested for any issues and to be sure it meets performance needs. In ETL scenarios, friendly deletes aren’t always needed as long as the database isn’t used at all hours of the day, but it can still cause locking or log issues. You will need to weigh the costs and benefits of creating a friendly delete. Friendly deletes take thought and at least a small amount of development time, so it isn’t a free solution, but it can be a useful strategy.

References

https://learn.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver16

https://learn.microsoft.com/en-us/azure/azure-sql/database/elastic-jobs-overview?view=azuresql