Lipoaspiration in your SQL Server Database

Once, when disk space was at a premium, DBAs fought hard to keep the size of their database down. Now there seems less motivation to 'fight the flab' of a database. Fabiano Amorim was watching television recently when the subject matter, cosmetic surgery, gave him the theme and inspiration for this guide to keeping your database fit and trim.

Introduction

Don’t panic! No, I’m not ceasing my epic series about ‘ShowPlan’ operators and the Query Optimizer. I just felt that it was, perhaps, time to take a short break in order to write about something different. Some time ago, I presented a webcast on the topic of reducing the size of a database just by slimming down the parts that weren’t necessary; so I hoped that it might interest you if I wrote about it.

My fellow countryman Dr. Robert Ray, well-known in the States for his television show Dr. 90210, is my inspiration for writing this article. He knows how to tune the human body by using  plastic surgery, and I want to show you how to tune your database to reduce its unnecessary bulk!

The range of SQL Server’s functionalities grows with each release, and it’s increasingly difficult for DBAs and Developers to keep up with all these changes, so it is not surprising to see DBAs and Developers following bad practices that make their database unnecessarily slow and inefficient.

We’re talking about quick reductions by cutting the “fat”.

I’ll show you some techniques and features with the goal of quickly reducing the size of your database to a new, slimmer, figure. Some of these techniques can be safely used, whilst others should be used only with caution. These techniques don’t aim to release free space by archiving older data, regularly rebuilding indexes, or keeping log files small by backing and truncating them up frequently. We won’t even cover techniques such as Database design normalization. We’re talking about quick reductions by cutting the “fat”.        

You may be wondering “Why do I have to do this?” The answer is, if you have a small database you increase the efficiency of the use of your memory, you have fewer locks, less network traffic, you do fewer I/Os, you use fewer CPU cycles: All of this translates to one point, “Faster Queries”.

Following is a list of items you should consider.

  • Clustered Indexes
    • Not being Unique
    • Badly chosen
    • Too wide
    • Variable width
    • nullable
  • Non-Clustered Indexes
    • Being duplicate
    • Not being used
    • Including columns in leaf levels (INCLUDE clause)
  • Using Filtered Indexes
  • FillFactor and Pad_Index
  • Using Sparse Columns
  • Choosing appropriate and compact DataTypes for example
    • DateTime vs SmallDateTime vs Date or Time
    • BigInt vs Int vs SmallInt vs TinyInt
    • NChar vs NVarChar vs VarChar
  • Using VarDecimal storage format
  • Using Page and Row Compression

All of these items are very important; some of them are available only starting with SQL Server 2008 and some are available only in enterprise edition; so, let’s take a look at them, one by one, to understand better all the benefits in order to use them properly.

Clustered Index

By default, when you define the primary key of your table, a clustered index is automatically created. Clustered indexes maintain the data at the leaf level of the index, in sorted order based on their key values.

By default your primary key is created as a Unique Clustered Index, so choose carefully which columns will belong to your Clustered Key (Primary Key columns), because all values of these columns will be replicated to the existing Non-Clustered Index. This is necessary because when SQL reads data from a non-clustered index it may need to lookup to your clustered index to read the data columns that is not in the non-clustered index. The clustered key columns are used to lookup from non-clustered index to the clustered index. That means, if the Primary Key is composed of four columns, all those columns are replicated and maintained in the Non-Clustered Indexes.

When you create a clustered index that is not unique, it means that the values can be duplicated, so the engine automatically adds a 4-byte uniqueifier column to the table. This extra 4 bytes don’t only wastes space, but it has a cost to be computed. When a new row is inserted SQL Server has to scan all duplicated values to find which will be the new uniqufier value.

You should avoid that because it requires extra work to SQL Server and also requires extra space to store the data. Again, extra work and extra space means slow applications.

Non-Clustered Indexes

One of the most common source of wasted space in Databases comes if you have duplicate or unused non-clustered indexes.

DBAs! Don’t allow developers to create tons of indexes just because they heard that it is good for performance.

Developers! listen to me, because I’m a developer too. I’ll tell you this with love: you know the query, you know the business logic, so don’t create an index just for the query without first checking whether it will be used by any other query, and then making sure that it does not exist already.

Another mistake is to allow Data Modeling software to create all indexes based on what it thinks is a good indexing strategy. Look at the scripts generated by the software before running them. Some of the indexes may not be useful after all. By not creating them, you, are more likely to keep your database slim.

Since SQL Server 2005, we have a new feature to include columns into the leaf level of the index b-tree. If you created an index with additional columns in the key to cover a query, you should check if all columns are really needed as part of the key. Otherwise you can re-create the index specifying non-key columns in the INCLUDE clause. The index will continue to cover the query, but it will be slimmer because values of the non-key columns will not be stored in the upper levels of the b-tree.

Filtered Index

A filtered index is an index for which you specify a predicate that filters the rows of the table before adding values to index itself. Well-designed filtered indexes can help in reducing the storage cost compared to indexes that contains all values of the table that they refer to.

It is quite an easy feature to use, and if it is used with Sparse Columns, it can significantly reduce storage space for NULL “values”. The following script shows how to create an index on a table called Orders that include only the orders made after the year 2009.

Script 1:

Not only you will get benefits in terms of space, but also you may get better distribution statistics. Filtered indexes (along with filtered statistics) in some cases may be helpful to better capture the distribution of values potentially leading to more efficient execution plans.

FillFactor, Pad_Index

Don’t just assume that the fill-factor can be left to its’ default setting. The Fill-factor value determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth.

Don’t confuse fill-factor with free space. If you specify 30, it doesn’t mean you will have 30% of free space to store future data. It means the system will try to maintain 70% of your pages empty and just 30% filled with data. When you enable the Pad_Index option, the percentage of space reserved to fill the leaf pages is also used for intermediate levels pages.

In an ideal world, you would have time to look, table by table and index by index, to estimate what will be best value for the fill-factor for each case. The following command is used to rebuild an index with a specific fill-factor.Script 2:

If I’m forced to give you a general recommendation, It would be to never trust in a general recommendation. That’s why I will not recommend a fixed value for a updated table, what I can tell you is that, if you have a table that is never updated, then you should leave the fillfactor to 0 or 100 (remember 0 or 100 is the same thing, and they don’t leave space for future updates).

You should analysis your system to know when page splits are occurring and set the appropriate fill factor for each index. A very nice approach is to use extended events to monitor the page splits. My friend and MVP Eladio Rincon blogged about it, and the next SQL Server version has some very useful changes in this matter, you could see it here.

One think you can’t forget is, leave it without control means two problems: Many page splits or wasted storage.

Sparse Columns

Sparse columns were introduced into SQL Server 2008 and, are available for all SQL Server Versions. The ‘Sparse’ column was designed to optimize the storage for NULL values. When you have a column with a lot of NULL “values”, SQL Server still reserve the space to this column.

If you have a table with a big percentage of NULL occurrences you should think in a possibility to use sparse columns to save space.

For instance, let’s suppose a table called customer and a column date_death. Hopefully, the average of non-null values in this column will be lower. This columns is a good candidate to be a sparse column.

Be careful in your choice because the insertion of non-null values into a sparse column will require more space than a non-sparse column. Look at the Books Online for a complete list of the space usage for each data type. Here are the commands to add or drop a sparse to a column:

Script 3:

Note: Sparse columns is a storage improvement not a new DataType.

Data Types

This is another common mistake by developers and DBAs. The most used data types are Integer and DateTime; but quite often, the stored value isn’t the best fit for the data type used to store the data.

For instance, a date/time column that stores only the date part, not the time, could be a Date column, or if you don’t have SQL Server 2008 and so can’t use the Date data type, you could use the SmallDateTime data type. The same rule works for BigInt columns, to Int, or SmallInt. A common case in point is a ‘month-of-the-year’ column using the Int data type, which only has twelve legitimate values and so could be a TinyInt data type.

Following is a table of the date data types and how many bytes they use.

Data Type

Storage Size

DateTime

8

SmallDateTime

4

Date

3

Time

5

BigInt

8

Integer

4

SmallInt

2

TinyInt

1

NChar(40)

80

Char(40)

40

VarDecimal

VarDecimal was a feature with a short life in the history of SQL Server so I’m only mentioning it as a curiosity. In SQL Server 2005 Service Pack 2 (SP2) and later versions of 2005, decimal and numeric data types could be stored as a variable-length column by using the VarDecimal storage format.

At SQL Server 2008 it is already deprecated, so you should use it only at your own risk. If using SQL Server 2008 you should use page and row compression instead.

If used properly it can make a huge difference into the space required to the Decimal and Numeric data types. You can use the procedure …

.. to estimate storage reduction of a table using the VarDecimal.

Following is the code to estimate the reduction of a table:

Script 4:

In SQL Server 2008 it was replaced by the page and row compression.

Page and Row Compression

This feature can make a huge difference to storage space in your database, but you can’t assume that it would work well for every database: it is one of those features you have to experiment with in order to check whether it works for your particular database and data.

There are two types of Compression of data in SQL Server 2008, Page and Row Compression:

  • Row Compression: It uses variable-length storage format for numeric types (for example integer, decimal, and float) and the types that are based on numeric (for example datetime and money).
  • Page Compression: Uses a algorithm of compression to compress the data in a SQL Server page.

Unfortunately this feature is Enterprise Edition only, but it has been used for many companies with great success. Here we have simple script to showing the power of the compression.

Script 5:

The Script 1 creates a table called T1 and inserts 100000 rows in the table. It uses 49 MB of storage size. Now let’s look what will be the storage size if we enable the page compression of the table.

Script 6:

After you run the script 6, the storage size of the table drops from 49 MB to 12 MB.

Conclusion and Script

I realize that I’ve mentioned a lot of things without a full explanation about how it works, the main point is, you should consider using most of these items, but before you apply this in a production environment, of course you should test it and learn about the advantages and disadvantages about each item. It is important to repeat the warning: Databases, servers and installations vary a great deal. The only way of telling what works for your database is to try it out. Any compression comes at the cost of extra CPU, which could complicate the performance gain in your setup.

To the end, I’ve created a script so you can experiment to see how all of this together can shrink a table from 1.5 GB to 129 MB only using native SQL Server features.