What’s the Point of Using VARCHAR(n) Anymore?

The arrival of the (MAX) data types in SQL Server 2005 were one of the most popular feature for the database developer. At the time, there was a lot of discussion as to whether this freedom from having to specify string length came at a cost. Rob attempts to give a final answer as to any down-side.

When you are storing large strings, or any other data types that you’d want to store as VARCHAR or VARBINARY, there are factors that affect how and where that data is stored inside SQL Server. This in turn will affect the performance of inserts and updates.

Basics: Allocation Unit Types, Row and Page Size Limits

SQL Server’s unit of physical storage is a page. Each page has a fixed size of 8K bytes (ref). Where possible, data is stored in rows within these ‘pages’. A row cannot overlap a page boundary. If the space allocated to a datatype is fixed, or if the variable sized data is small enough to allow the row to fit, the base allocation unit type, “IN_ROW_DATA”, is used. However, there are two ways of storing data that would otherwise overflow a page boundary.

  1. When a single row grows too large to be stored in-row, data can be offloaded to “ROW_OVERFLOW_DATA” pages.
  2. When a single column stores more than 8,000 bytes, or if the developer chooses to force this behavior, data is stored in “LOB_DATA” pages.

This becomes much clearer through examples.

Exercising Row and Page Size Limits

Single String Column with VARCHAR(n) and VARCHAR (MAX)

8,000 Characters

First, let’s build two tables and exercise some long strings to see how SQL Server handles both normal and large strings. Here, we will be exercising row size limits.

After each insert, we will use versions of the following query to see into which type of page these records were loaded (see MSDN for details on the DMV):

The query results show that only “IN_ROW_DATA” pages were written for both tables, and the pages are 98.97% full.

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMax IN_ROW_DATA 1 98.97 1 8011 8011 0
OneColVarcharN IN_ROW_DATA 1 98.97 1 8011 8011 0

8,001 Characters

Clearly we can’t insert 8,001 characters into an 8,000-character column, but we can in the VARCHAR(MAX) column.

The query results here show that pages were written to both “IN_ROW_DATA” and “LOB_DATA” pages.

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMax IN_ROW_DATA 1 0.43 1 35 35 0
OneColVarcharMax LOB_DATA 1 99.02 1 8015 8015 NULL

Note that the “in row” record is very small. In this case, the only thing stored “in row” is the 24-byte pointer. The full 8,001-character string is moved to the LOB page.

So SQL Server is storing normal VARCHAR(n) columns and VARCHAR(MAX) columns “in row” by default. When VARCHAR(MAX) exceeds 8,000 characters, the pointer is stored “in row”, and the string is stored in “LOB” pages.

Two String Columns with VARCHAR (n) and VARCHAR(MAX)

Let’s try this again but with two columns. This time, we will exercise page size limits.

8,000 Characters

As expected, two 4,000-character strings fit fine in both tables.

The query results show that only “IN_ROW_DATA” pages are written for each table, and the pages are 99.00% full.

8,060 Characters

The maximum number of bytes per page is 8,060. Let’s see how this works.

It actually doesn’t require the full 8,060 characters to fill that page. 8,047 characters get written to just “in row” pages, but at 8,048 characters, it writes to both “in row” pages and either “row overflow” pages (VARCHAR(n)) or “LOB” pages (VARCHAR(MAX)).

Results with 8,047 Characters (4,023 and 4,024 characters)
TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
TwoColVarcharMax IN_ROW_DATA 1 99.58 1 8060 8060 0
TwoColVarcharN IN_ROW_DATA 1 99.58 1 8060 8060 0
Results with 8,048 Characters (4,024 and 4,024 characters)
TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
TwoColVarcharMax IN_ROW_DATA 1 50.17 1 4061 4061 0
TwoColVarcharMax LOB_DATA 1 49.89 1 4038 4038 NULL
TwoColVarcharN IN_ROW_DATA 1 50.17 1 4061 4061 0
TwoColVarcharN LOB_DATA 1 49.98 1 4038 4038 NULL

Notice that at 8,047 characters, the record size is exactly 8,060 bytes. What is in that 23 bytes? “Each data page contains a page header which stores page meta-data, such as database file identifier, current page number, previous and next page numbers, number of free bytes per page and so forth.” (ref)

200,000 Characters

For the VARCHAR(MAX) table, the results are basically the same even if the column width is pushed far past the 8,000-character limit. Here, we write two 100,000-character strings.

Results
TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
TwoColVarcharMax IN_ROW_DATA 1 0.75 1 61 61 0
TwoColVarcharMax LOB_DATA 27 91.89 28 228 8054 NULL

Here, we have a single page storing two pointers and 27 pages storing the large strings.

In-Row vs. Out-of-Row LOB Storage

For “large-value data types” like VARCHAR(MAX), SQL Server allows the data to be stored in-row (up to 8,000 bytes) or out-of-row (ref).

We will use two new tables, one in-row and one out-of-row:

To change a VARCHAR(MAX) column from the default (in row) to out-of-row, execute this statement:

1 Character

Writing a single-character string to each column in these tables creates an interesting result:

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 0.15 1 12 12 0
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 1 1.04 1 84 84 NULL

The in-row table created just one page. The out-of-row page created one page for the pointer and one page for the out-of-row string. Notice the byte count difference: the in-row record takes 12 bytes, where the out-of-row record takes 111 bytes.

This illustrates two downsides of out-of-row storage when small records are written:

  1. Multiple pages are affected regardless of the size of the data being written.
  2. Extra storage space is required.

8,000 Characters

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 98.97 1 8011 8011 0
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 2 50.02 2 84 8014 NULL

Notice that the out-of-row table writes two pages instead of just one (“PgCt” column). It actually takes only 65 characters in the out-of-row table to cause it to create the second page.

8,001 Characters

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 0.43 1 35 35 0
OneColVarcharMaxIn LOB_DATA 1 99.02 126 8015 8015 NULL
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 2 98.43 127 84 8015 NULL

Once the byte count goes over 8,000, the in-row table writes to a LOB page.

1,000,000 Characters

TableName AllocUnitTp PgCt AvgPgSpcUsed RcdCt MinRcdSz MaxRcdSz FwdRcdCt
OneColVarcharMaxIn IN_ROW_DATA 1 0.43 1 35 35 0
OneColVarcharMaxIn LOB_DATA 126 98.42 126 2020 8054 NULL
OneColVarcharMaxOut IN_ROW_DATA 1 0.33 1 27 27 0
OneColVarcharMaxOut LOB_DATA 126 98.43 127 84 8054 NULL

With such large strings, the results are very similar between default (in-row) and out-of-row settings.

Performance

So how do these various choices affect performance? Let’s code some inserts and record results with different settings and values.

INSERT – 0 to 8,000 Characters

First, we’ll test the performance of a process that inserts strings which fit in VARCHAR(n).

The basic idea here is to insert 10,000 rows of data and record the timing. I ran the tests on my desktop. This is not a server-class machine but is a very high-powered desktop. All tests were executed on SQL Server 2008 R2 and Windows 7 Enterprise.

The results here show microseconds per insert for record lengths between 0 and 8,000 characters. The complete collection of test scripts is linked in the References section at the end of the article.

1432-image001small.png

See detailed results in Table 1.

The results here show that VARCHAR(n) is slightly faster than VARCHAR(MAX) (in-row) only for 0-character and 1,000-character strings. At 2,000 characters or above, VARCHAR(MAX) (in-row) is fastest. This result was not expected.

Intuitively, we expect that VARCHAR(n) should be faster than VARCHAR(MAX) (in-row), and in some cases it is. What I think this shows is that, if there is a performance difference, it is small enough that we can’t measure it consistently.

See detailed results in Table 1.

INSERT – Greater Than 8,000 Characters

These tests are limited to VARCHAR(MAX) but compare in-row to out-of-row.

The results here show microseconds per insert for records lengths between 9,000 and 30,000 characters.

1432-image002small.png

See detailed results in Table 2.

UPDATE – Column Limit

When a VARCHAR(MAX) in-row column goes over the 8,000-character limit, the data is moved to a LOB_DATA page. The linked script Allocation Units – UPDATE.sql tests this 8,000-, then 8,001-, then 8,000-character data shift. This test illustrates the move of data from an IN_ROW_DATA page to a LOB_DATA page and then back for the in-row table.

These operations can be expensive, but how expensive? Make a quick mental prediction before reading on.

The linked script PerfTest – UPDATE.sql tests both the 8,000-to-8,001 scenario and the 8,001-to-8,000 scenario. The results were actually quite surprising. I kept checking the code to see if I had done something wrong because the results didn’t match my expectations at all.

UPDATE Test Results

In/Out

From

To

 Elapsed Mcs

Penalty vs

Out-of-Row

In-Row

8,000

8,001

34,238,839

82%

Out-of-row

8,000

8,001

18,804,768

In-Row

8,001

8,000

9,969,989

-47%

Out-of-row

8,001

8,000

18,804,768

As expected, the 8,000-to-8,001 update takes longer for in-row than for out-of-row. The surprise is that the 8,001-to-8,000 update is faster for in-row. Based on these results, there must be a shortcut employed inside SQL Server that optimizes this move of data from a LOB_DATA page to an IN_ROW_DATA page.

UNICODE

All of these tests were conducted using non-UNICODE strings. Without compression, all the byte counts have to be doubled for UNICODE strings.

Recommendations

Don’t Use VARCHAR(MAX) Everywhere

One design-simplification technique that has been suggested is to use VARCHAR(MAX) for every string column. Though the tests here show that using VARCHAR(MAX) (in-row) instead of VARCHAR(n) is not a performance problem for strings under 8,000 characters, there are a number of reasons to limit the length of strings in your database.

UI Issues

Anyone with experience creating an application UI knows that handling long strings is difficult. When laying out the UI for an application, it is important to know the expected and maximum length of strings that will be displayed. Leaving this open to whatever someone writes to the database makes designing, building, and testing an application very difficult.

Performance – Writing or Updating Long Strings

From the performance results, you can see that writing long strings definitely affects write delays. While writing a record of any size (even zero characters) takes time, writing longer strings takes more time. As an example, using VARCHAR(n), writing 1,000 characters takes an average of 217 microseconds while writing 8,000 characters takes an average of 448 microseconds.

Compression

What I believe is the most compelling reason to avoid over-8,000-character strings is compression. LOB data never gets compressed.

“When considering data compression one very key thing to remember is that out of row data (LOB data) isn’t compressed. If the LOB data is stored in row, then it will be compressed, but only when you compress the table with PAGE level compression. If however you use ROW level compression then the LOB data will never be compressed, no matter if it is stored in row or if it is stored out of row.” (ref)

Conclusions

  • Where appropriate, use VARCHAR(n) over VARCHAR(MAX)
    • for reasons of good design if not performance benefits, and
    • because VARCHAR(MAX) data does not compress
  • Storing large strings takes longer than storing small strings.
  • Updating an in-row VARCHAR(MAX) value from below 8,000 to over 8,000 will be relatively slow, but the difference for a single transaction will likely not be measurable.
  • Updating an in-row VARCHAR(MAX) value from over 8,000 to below 8,000 will be faster than if the table is set to store data out-of-row.
  • Using the out-of-row option for VARCHAR(MAX) will cause slower writes until the strings are very long.

All the scripts used here are attached, so you are encouraged to retest the results.

References

Appendix

Table 1

TableName

MaxRecords

StrLen

ElapsedMcs

VarcharMaxIn

10,000

0

1,341,626

VarcharMaxOut

10,000

0

1,950,038

VarcharN

10,000

0

1,216,823

VarcharMaxIn

10,000

1,000

2,246,443

VarcharMaxOut

10,000

1,000

2,511,649

VarcharN

10,000

1,000

2,168,441

VarcharMaxIn

10,000

2,000

2,308,844

VarcharMaxOut

10,000

2,000

3,104,460

VarcharN

10,000

2,000

2,823,654

VarcharMaxIn

10,000

3,000

2,667,651

VarcharMaxOut

10,000

3,000

3,057,659

VarcharN

10,000

3,000

2,698,852

VarcharMaxIn

10,000

4,000

2,839,255

VarcharMaxOut

10,000

4,000

3,307,264

VarcharN

10,000

4,000

3,010,858

VarcharMaxIn

10,000

5,000

3,229,262

VarcharMaxOut

10,000

5,000

3,525,668

VarcharN

10,000

5,000

3,603,669

VarcharMaxIn

10,000

6,000

3,416,466

VarcharMaxOut

10,000

6,000

4,056,078

VarcharN

10,000

6,000

3,712,871

VarcharMaxIn

10,000

7,000

3,728,472

VarcharMaxOut

10,000

7,000

4,602,088

VarcharN

10,000

7,000

4,056,078

VarcharMaxIn

10,000

8,000

4,321,284

VarcharMaxOut

10,000

8,000

4,945,295

VarcharN

10,000

8,000

4,477,286

Table 2

TableName

MaxRecords

StrLen

ElapsedMcs

VarcharMaxIn

10,000

9,000

5,670,605

VarcharMaxOut

10,000

9,000

6,326,708

VarcharMaxIn

10,000

10,000

5,295,689

VarcharMaxOut

10,000

10,000

5,280,067

VarcharMaxIn

10,000

11,000

5,717,469

VarcharMaxOut

10,000

11,000

5,123,852

VarcharMaxIn

10,000

12,000

6,014,277

VarcharMaxOut

10,000

12,000

5,623,740

VarcharMaxIn

10,000

13,000

6,514,166

VarcharMaxOut

10,000

13,000

6,686,002

VarcharMaxIn

10,000

14,000

6,576,651

VarcharMaxOut

10,000

14,000

6,873,460

VarcharMaxIn

10,000

15,000

6,764,110

VarcharMaxOut

10,000

15,000

6,748,488

VarcharMaxIn

10,000

16,000

6,639,137

VarcharMaxOut

10,000

16,000

7,154,647

VarcharMaxIn

10,000

17,000

7,279,619

VarcharMaxOut

10,000

17,000

7,420,213

VarcharMaxIn

10,000

18,000

7,435,834

VarcharMaxOut

10,000

18,000

7,089,147

VarcharMaxIn

10,000

19,000

7,804,550

VarcharMaxOut

10,000

19,000

7,070,923

VarcharMaxIn

10,000

20,000

7,336,277

VarcharMaxOut

10,000

20,000

8,023,077

VarcharMaxIn

10,000

21,000

8,538,178

VarcharMaxOut

10,000

21,000

7,976,250

VarcharMaxIn

10,000

22,000

7,679,677

VarcharMaxOut

10,000

22,000

8,007,468

VarcharMaxIn

10,000

23,000

7,617,241

VarcharMaxOut

10,000

23,000

7,788,941

VarcharMaxIn

10,000

24,000

7,835,768

VarcharMaxOut

10,000

24,000

8,194,778

VarcharMaxIn

10,000

25,000

8,475,741

VarcharMaxOut

10,000

25,000

8,506,960

VarcharMaxIn

10,000

26,000

8,819,141

VarcharMaxOut

10,000

26,000

8,382,087

VarcharMaxIn

10,000

27,000

9,084,496

VarcharMaxOut

10,000

27,000

8,881,578

VarcharMaxIn

10,000

28,000

8,975,233

VarcharMaxOut

10,000

28,000

9,053,278

VarcharMaxIn

10,000

29,000

9,505,942

VarcharMaxOut

10,000

29,000

8,944,014

VarcharMaxIn

10,000

30,000

9,100,105

VarcharMaxOut

10,000

30,000

9,022,060