Replacing a Column in a Large Active SQL Server Table

In a previous post I showed how to use a batching strategy to remove large amounts of data from a table while it is being used. Today I will apply the same technique to another common problem- changing the datatype of a column. A common use of this is to normalize a text column into an integer (that references another table), but could be used to transition to and from any datatype . Many of the considerations in the previous post apply, so I advise you to read it as well before using this technique.

Setting up the Demo Database

Once again I will use the StackOverflow2013 database provided by Brent Ozar for the examples. I will change the schema to create a problem that I will later fix as the example. In my version of this database, the Posts table does not contain the column OwnerUserId but instead has a new column OwnerEmail. Now this would have only a slight performance impact if we are simply reading the value from the table, but things may get bad very quickly if we start using this in a WHERE or JOIN clause. This is a common SQL ‘Code Smell’ or anti-pattern. Fixing it on a busy system without taking a downtime can be challenging. Let’s see how to do it.

To follow along, run the following SQL script to add a new column, Email, to the Users table, and OwnerEmail to the Posts table:

USE StackOverflow2013;

ALTER TABLE dbo.Users ADD Email VARCHAR(100) NULL

ALTER TABLE dbo.Posts ADD OwnerEmail VARCHAR(100) NULL

UPDATE dbo.Users

SET Email =

CAST(Id as varchar(100)) + 'fakeemail@fakesample.com'

UPDATE p

SET OwnerEmail = u.Email

FROM dbo.Posts p

JOIN dbo.Users u

ON p.OwnerUserId = u.Id

ALTER TABLE dbo.Posts DROP COLUMN OwnerUserId

ALTER TABLE dbo.Users ALTER COLUMN Email VARCHAR(100) NOT NULL

ALTER TABLE dbo.Posts ALTER COLUMN OwnerEmail VARCHAR(100) NOT NULL

CREATE INDEX IX_Users_Email ON dbo.Users (Email)

The following two stored procedures are examples of stored procedures that the application would use to do typical user processing. As we change the structure of the table and the location of data the user needs, we will need to alter this code.

Note: If you are not using stored procedures, it will make the changes a bit more complicated. However, this method would still be achievable by altering the code being executed by the user interface, or perhaps even using triggers. The important thing is that you identify and address all of the code that uses the column being removed.

CREATE OR ALTER PROC GetUsersPosts

@UserId INT

SELECT p.Id

,p.Title

,u.DisplayName

FROM dbo.Posts p

JOIN dbo.Users u

ON u.Email = p.OwnerEmail

WHERE u.Id = @UserId

CREATE OR ALTER PROC InsertPost

@Body nvarchar(max) ,

@ParentId int =NULL,

@PostTypeId int ,

@Tags nvarchar(150)=NULL ,

@Title nvarchar(250) ,

@OwnerEmail varchar(100)

INSERT INTO [dbo].[Posts] (

[Body]

,[CreationDate]

,[ParentId]

,[PostTypeId]

,[Score]

,[Tags]

,[Title]

,[ViewCount]

,[OwnerEmail]

,[LastActivityDate]

)

VALUES (

@Body

,GETUTCDATE()

,@ParentId

,@PostTypeId

,@Tags

,@Title

,@OwnerEmail

,GETUTCDATE()

)

The Brute Force Method

First let’s look at the simplest way to perform this change, along with its downsides.

Note that you can execute this code if you want to see how it executes on your system. However, later code in the article will not execute without refreshing your copy of the database.

ALTER TABLE dbo.Posts ADD OwnerUserId INT NOT NULL

DEFAULT (-100)

UPDATE p

SET OwnerUserId = u.Id

FROM dbo.Posts p

JOIN dbo.Users u ON u.Email = p.OwnerEmail

WHERE p.OwnerUserId = -100

ALTER TABLE dbo.Posts DROP COLUMN OwnerEmail

While this code is completely functional, it is impractical for many workloads. Under the default READ COMMITTED isolation level this will block all access to the table, even SELECT statements. Using the READ_COMMITTED_SNAPSHOT option will allow SELECT statements but will block all data modifications. This took about three and a half minutes on my laptop. For larger tables it could even take hours and run the risk of filling up the transaction log. If that happens (or if any other errors occur that cause the process to fail) we are looking at a lengthy rollback as well. It is much safer and more pragmatic to update the table in batches.

How to Change your Table without Downtime

At a high level, this is how we will update the database and application:

One of the advantages of this technique is that it can be paused at any moment. If we run out of time while updating the data in the new column we can stop the process, wait until the next day and pick up where we left off. We can verify and test after each step and move on at our own pace, rather than rushing through the change during a downtime.

Adding a new column

To start this process, we will create a new column with the desired datatype. In our example we want this column to be NOT NULL so we will create a default with a placeholder value, in this case -100. It’s important that the placeholder doesn’t exist in the table that we will eventually join to with this column. Creating a column with a default in this way doesn’t modify the pages of the table, only the metadata, which makes this a fast and efficient change.

1 2	ALTER TABLE dbo.Posts ADD OwnerUserId INT NOT NULL DEFAULT (-100);

Be aware that this will require a schema stability lock so it might take a few seconds on a busy system. However it should not block for any significant length of time as it requires minimal IO. Using SQL Profiler, we can see it only took 10 milliseconds on my system:

After we have created the new column, we will need to modify any processes that write to the table to INSERT (and UPDATE) the table such that both the old and the new column are kept in sync.

Next, we need to alter the procedure for inserting a post. We do not need to modify the procedure to read a post, because the OwnerEmail column is still in the table. So now we will alter our example INSERT stored procedure:

CREATE OR ALTER PROC InsertPost

--The inputs stay the same

@Body nvarchar(max) ,

@ParentId int =NULL,

@PostTypeId int ,

@Tags nvarchar(150)=NULL ,

@Title nvarchar(250) ,

@OwnerEmail varchar(100)

-- Added in this revision

DECLARE @UserId INT

SELECT @UserId = Id

FROM dbo.Users

WHERE @OwnerEmail = Email

INSERT INTO [dbo].[Posts] (

[Body]

,[CreationDate]

,[ParentId]

,[PostTypeId]

,[Score]

,[Tags]

,[Title]

,[ViewCount]

,[OwnerEmail]

,[LastActivityDate]

,[OwnerUserId] -- Added in this revision

)

VALUES (

@Body

,GETUTCDATE()

,@ParentId

,@PostTypeId

,@Tags

,@Title

,@OwnerEmail

,GETUTCDATE()

,@UserId -- Added in this revision

)

At this point, any new records will include proper data for the old column, Email, as well as our new column, OwnerUserId.

Moving the Data

Now we will use our batching technique to UPDATE our new column. As we did in the deletion example, we will first load up a temp table with the records we want to modify, filtering on our placeholder value in the OwnerUserId column. One advantage of doing this is that if the process is stopped for any reason, it can be resumed later without re-processing records that were updated the first time (by simply running the script again after fixing the error). With each batch we will join to the Users table to retrieve the OwnerUserId.

Let’s see the script:

SET NOCOUNT ON

--control the number of rows modified per iteration

DECLARE @BatchSize INT = 5000

--variable used to tell the process to stop

DECLARE @Stop INT = 0

IF (OBJECT_ID ('tempdb..#ToProcess') IS NULL)

CREATE TABLE #ToProcess (Id INT NOT NULL PRIMARY KEY CLUSTERED)

IF (OBJECT_ID ('tempdb..#Batch') IS NULL)

CREATE TABLE #Batch (Id INT NOT NULL PRIMARY KEY CLUSTERED)

-----------------Gather Ids------------------------------------

INSERT INTO #ToProcess (Id)

SELECT Id

FROM dbo.Posts

WHERE OwnerUserId = -100

-----------------Main Loop------------------------------------

WHILE (@Stop = 0)

BEGIN

--Load up our batch table while deleting from the main set

DELETE TOP (@BatchSize) #ToProcess

OUTPUT DELETED.Id INTO #Batch (Id)

--Once the rowcount is less than the batchsize,

-- we can stop (after this loop iteration)

IF @@ROWCOUNT < @BatchSize

SELECT @Stop = 1

--Perform the UPDATE

UPDATE p

SET OwnerUserId = u.Id

FROM dbo.Posts p

JOIN #Batch b ON p.Id = b.Id

JOIN dbo.Users u ON u.Email = p.OwnerEmail

--Clear out Batch table

TRUNCATE TABLE #Batch

END

When testing this technique before using in production (and you should test it!), look at the execution plan used by the UPDATE statement. You want it to look something like this:

Notice the thin bars connecting the operators and the Index Seek on both the Posts and the Users table. If you see thick bars, parallelism or a Scan operator, consider creating an index on the table you are joining too (in this case the Users table). If we had not created the IX_Users_Email index this would have been a much more expensive and time-consuming operation. Instead, it only took about five and a half minutes on my machine. While this is longer overall than the Brute force method, each iteration of the loop took under 100 milliseconds so any blocking would be minimal. You can see this by using SQL Profiler with the SQL:StmtCompleted event:

If I were doing this on an actual server rather than my laptop, I would likely increase the batch size to speed up the overall process (but not so large that the duration of each loop becomes an issue).

Finalizing the move

Once we have updated the entirety of the new column, we can change our processes once again. First, we will change any reads to reference the new column, and then we will remove any reference to the old column from any processes that write to the table:

CREATE OR ALTER PROC GetUsersPosts

@UserId INT

SELECT p.Id

,p.Title

,u.DisplayName

FROM dbo.Posts p

JOIN dbo.Users u

ON u.Id = p.OwnerUserId

WHERE u.Id = @UserId

CREATE OR ALTER PROC InsertPost

@Body nvarchar(max) ,

@ParentId int =NULL,

@PostTypeId int ,

@Tags nvarchar(150)=NULL ,

@Title nvarchar(250) ,

@OwnerEmail varchar(100)

-- Added

DECLARE @UserId INT

SELECT @UserId = Id

FROM dbo.Users

WHERE @OwnerEmail = Email

INSERT INTO [dbo].[Posts] (

[Body]

,[CreationDate]

,[ParentId]

,[PostTypeId]

,[Score]

,[Tags]

,[Title]

,[ViewCount]

--,[OwnerEmail] --Removed in this revision

,[LastActivityDate]

,[OwnerUserId]

)

VALUES (

@Body

,GETUTCDATE()

,@ParentId

,@PostTypeId

,@Tags

,@Title

--,@OwnerEmail --Removed in this revision

,GETUTCDATE()

,@UserId

)

Once this is done, and once any QA checks have been completed, we can drop the old column:

1	ALTER TABLE dbo.Posts DROP COLUMN OwnerEmail

This only took 15 milliseconds on my laptop, but again this is something you need to test in your environment, especially on very high activity servers for data that is transactional in nature. Better if your simple change doesn’t lose someone their purchase.

For very large tables on busy systems, dropping the column may be problematic. We might be better off with a variation of this technique that inserts rows into a new table entirely. In this case our UPDATE and INSERT statements have to be performed on both tables within a transaction while we are in the intermediate step.

Further Considerations

It’s always best to run through this process (likely a few times) on a non-production server with a copy of your production data. You want the update to be fast enough to be imperceptible to other processes, so understanding how long each batch takes to run and the impact it will have on your production system is very important. Selecting a batch size that allows this process to complete quickly and adding any necessary indexes are critical steps during the planning phase to minimize any impact.

It is also crucial to check your transaction log size to ensure you have enough space to hold these modifications. Consider increasing the frequency of your log backups while running the update and double check that you have alerting on the growth of the log and availability of space on the disk it is placed on.

Summary

Using a varchar column to reference another table can be a huge performance killer and stands in the way of a well-normalized database, but changing the schema of a table in use is a challenging problem. A lengthy downtime might be the quickest solution to come to mind, but this technique will allow you to do the work in smaller bites without taking a downtime, all while eliminating the risk and stress of a long rollback process in the case of an error or a full transaction log.

Register for Simple Talk

Replacing a Column in a Large Active SQL Server Table

Setting up the Demo Database

The Brute Force Method