An entire post on choosing the correct data types either seems like overkill or much needed and overdue. The perspective might vary based on the databases you’ve worked with recently. I decided to write this after seeing some code with data type decisions that I would classify as questionable. There are many decisions in technology that can be ambiguous, but the correct date type should be based on business rules and a set of technical guidelines. I am going to share my thought process for deciphering the correct type here.
Selecting a data type is an important part of database and table design. The column represents an actual business attribute, is used to support data integrity, or is used for performance considerations. Care should be used when selecting the definition for each column. Choosing the wrong type can impact each of these areas, makes the system difficult to work with, and makes integrations harder than necessary.
This post is targeted at data types natively available in SQL Server, but the basic concepts apply to all data engines. The physical data model should be representative of the underlying attribute it is storing. The names and exact data types will differ for each engine, but the same general rules apply to each.
These rules also apply to existing databases, but the threshold for change is much higher. An existing system presumably has applications or reports built on that data, which makes it much more complicated than choosing the correct data type during the initial design phase. If you are refactoring code, it is useful to look at the data types. This is especially true if the current data types are obviously wrong and causing problems. In these cases, it is worth the effort to change the data types.
Metrics for choosing a data type
- Accurately represents the attribute
- Sized correctly
- Performance
- Consistent
- Clear intent of purpose
The first metric of a good data type is when the data type chosen is the best representation of the attribute. In an RDBMS system, the tables shouldn’t look like an Excel spreadsheet or a sloppily imported CSV. A numeric column should have a numeric data type. A date should be in a date column. If you have a good understanding of the data, the decision process should be clear. If not, you’ll need to profile the data.
The second metric is that the data type allows for future growth, but is not too large. It is sized correctly. This is a metric for most data types. To understand if the column is sized correctly, you need an understanding of the business attribute. Sometimes this is easy and sometimes you have to use your best guess. But the analysis should happen and care should be taken when choosing column sizes. If an attribute will never be larger than 255, such as number of bathrooms in a house, it should be a tinyint instead of int or even a smallint.
The third metric is performance. This will generally follow from choosing the correct data type, but it may influence decisions you make regarding data type decisions. Smaller data storage results in faster I/O. Data types with a fixed storage space (int, decimal, date, etc.) don’t have flexibility in their data footprint. To reduce the space used, you need to choose a different type. It’s not wrong to choose a bigint if that potential storage space is needed. But if you only need space for an int, it’s wasted space, which means wasted performance.
The fourth metric to consider is consistency. Data types should match between tables for the same attribute (i.e., the UserName column should be defined identically in each table in which it resides). This is obvious, but it isn’t always implemented. When the same attribute is represented in multiple tables and columns, it is very confusing when consistency is lacking. This is enforced if you have foreign keys linking the tables. If they are denormalized report tables or not modeled ideally, these constraints won’t be in place and it is possible for mismatched data types. In addition to the confusion, there is a performance penalty if you join these tables via a query and the data types don’t match. You will see this in query plans as implicit conversions.
All of these metrics lead to intent of purpose. One of the primary reasons for carefully selecting your data type is to indicate the purpose of the column to anyone looking at that attribute. Everything you can do in your design to make interpreting objects (entities, attributes, etc.) easier is a good thing and almost always worth any extra effort put into the decision. Usually, correctly choosing your data type makes everything else easier too. I only say “usually” easier because I’m sure there are some edge cases, but they are the exceptions. Clear intent of purpose is a good indicator of correct column design. It isn’t a guarantee, but it certainly adds weight to your decision.
Rules for data type selection
Based on the metrics above, and good design practices, the following are the rules I use and recommend for selecting data types. There is room for some debate, and these reflect my personal preferences, but they are a good starting point. The main objective is to carefully consider your design decisions and understand why you are selecting a particular type. Consistency is also a primary objective. Sometimes an imperfect type is the better choice, if it is consistent with the rest of the database.
Approximate numerics
Only used for data that does not fit into decimal or other precise types
Float / real columns should only be used for numeric data that doesn’t fit into the decimal data type. As the data category in the Microsoft documentation states directly, these are approximate data types. They should not be used for exact numerics. The examples I like to use for this are things like number of meters to the sun, or the mass of earth measured in kilos – anything that would normally be represented using scientific notation. These are generally large, imprecise values. You wouldn’t use approximate numerics for something you can easily measure, such as the weight of cargo on a truck or the number of hectares in a corn field, these would be better represented using a decimal or integer.
Modern implementations of the floating-point standard make the output from the CPU consistent, but different data products handle the output of and rounding differently for approximate numbers. This makes scenarios such as testing, comparing data between systems and reporting more complicated. This can be challenging when synchronizing data between systems due to the imprecision. The potential for rounding differences and rounding errors makes implementation more complicated and another reason I rarely use these types.
The main consideration I have seen for using approximate numerics is faster arithmetic operations. If this is extremely important for your application, arithmetic operations within the SQL Server engine, the tradeoffs may be worth it. Usually high-priority calculations will be done in an API layer, so in practice, I haven’t seen this as sufficient reason to use floating point data types except when dictated by the underlying attribute.
Character strings
Should be sized correctly and avoid fixed length strings if not needed
The primary consideration with character strings is determining the correct length. If I see a table with many varchar columns of the same length, I suspect a modeling problem or poor data profiling. There isn’t a direct size penalty for having a varchar field defined larger than needed, but it is suspicious. If the table looks like an Excel file (with all varchar(255) or similar columns), I really suspect that no analysis was done. When I see tables like this, I also expect to find other design issues.
The other big consideration for character strings is variable length strings stored in char columns. This is wasted space (and wasted performance), and can lead to space padded columns and subsequent cleanup for reports and system integrations. If the varchar column is sized correctly and anticipates all possible values, it is correct. The easiest way to know if it is sized correctly is if the size matches the source of truth, the data governance model, or the rules implemented in the GUI and API layers. These items should all match and be reflected in the data model.
Unicode character strings
Only used when necessary, driven by business need
Unicode data types (nchar / nvarchar), follow the same rules as the standard character strings. They should be sized correctly. This type allows expanded character sets to be stored correctly. As with character strings, If there are multiple columns with the same Unicode data type and length, it is a code smell.
The different between Unicode strings and the ASCII character strings, other than the obvious ability to store Uncode strings, is the storage space required for each. Unicode strings are also called wide data types. As that name implies, they are bigger than the standard character strings – they require double the space. That translates directly to performance (via rows per page). It can have an even bigger impact when used as part of the cluster.
Unicode character strings should only be used when necessary. SQL Server performance is highly impacted by limiting the amount of I/O required to return the data needed for a particular query. Each data fetch operation retrieves a page (8K) of data at a time. The more rows of data that can be packed into a page, the better that table will perform (at a very generalized, simple level). Reducing the size of your columns and in turn, increasing the number of rows on a page of data, is free performance gains. Or conversely, decreasing the number of rows per page for no reason is giving away performance for absolutely no reason.
I recently diagnosed this issue. The column design was a clear indicator that no thought had gone into the data modeling for a table. An import used nvarchar(255) for an attribute that would never be bigger than 20 characters wide and would never need Unicode support. A varchar(20) would be a reasonable choice for this column. Even worse, this attribute was represented in several other tables in the database using the correct type. This kind of sloppiness causes performance issues and confusion when doing system integrations. I was getting ready to suggest an RLS access predicate for this column and used varchar(20), which could have caused more problems.
My primary rule with Unicode data is it should only be used when required. If you produce software with an audience and data only using a Latin character set, you default should be char / varchar. If it is needed, that’s fine. But it shouldn’t be the default unless it is driven by an actual business need.
Exact numerics
Use the correct size and avoid for strings that look like numbers
Exact numerics are for exactly that purpose, exact numerical values. Acres in a field, grams of a product, salary, units sold, or anything else that is an exact number and will fit into one of the exact numeric data types. Refer to the size boundaries in the documentation and repeated below for those boundaries. Most exact numeric values will fit into one of these categories. As with strings, exact numerics should be sized correctly. They should be the smallest size possible while allowing for future growth and be consistent within in the enterprise.
There are many attributes that contain only numeric values, but should not be stored as a number. These character strings that only look like numbers should be stored as character strings. And since they only contain 0-9 (and maybe some formatting), there is no need for Unicode character strings. I generally consider it a code smell if these values are stored as exact numerics. This type of attribute are things like:
- Zip codes
- Social security number / tax id
- FIPS code
- Phone numbers
- House number / Street number
Why is this a code smell and why aren’t they exact numerics? No math is performed on these attributes. If it doesn’t make sense to aggregate these numbers or perform math on them, it’s an indication they may not fit into an exact numeric. The second consideration is if the value could have a leading zero. If stored as an exact numeric, that leading zero would be lost. It requires conversion to a string and some calculations to put those zeroes back. Yes, it is easy to do those calculations using LEN and REPLICATE, but avoid this need. If the value requires character formatting (such as a Social Security Number or a 9-digit Zip code), store it as a character string. The last item is probably obvious, but I’ll mention it for the sake of completeness. If future values might need non-numeric values intermingled with the numeric values, it is definitely not an exact numeric.
Bit
Using bit is a judgement call – stay consistent
Bit columns take a full byte to store, with the caveat that 8 bit columns will share that byte. That means if you have more than 1 bit column in a table, space is saved. If you have a single bit column, there is no space savings. Given the amount of space I see wasted in table design, this potential savings won’t sway my decision process. The bigger consideration for bit columns is that you can’t use some aggregation operations on the bit data type. If you want to perform a SUM, you need to first convert the column to tinyint, then perform the aggregation. I find myself doing these types of aggregations frequently for data profiling operations, such as null analysis. That alone is generally enough for me to choose tinyint over bit. The advantage of bit is that the intention of purpose is clear and allowed values are limited. Stay consistent in your design and understand the reason for choosing, or not choosing, bit as a data type.
Date and time
Use the correct size
Date and time data typing follows the same principals as the other types. Use the smallest type that fits the data. If the data doesn’t include the time, use a date column. If you only need a time column, use that rather than a datetime column. If your date and time don’t include seconds, use a smalldatetime. Datetime2 storage space differs based on the precision specified. Datetimeoffset should only be used when you will actually set and use the offset.
Other data types
Text, ntext, and image
Avoid these types and use the newer replacements
Text, ntext, and image data types are older data types that require special functions for manipulating the data in them or are simply less efficient than the newer counterparts. These data types are scheduled to be deprecated in a future version of SQL. Use varchar(max), nvarchar(max) and varbinary to replace them as needed.
Uniqueidentifier (GUID)
Avoid using for the clustered index
I understand the argument for using a GUID as your primary key. It can work for mobile applications, give more control in the application layer, it can be part of a partitioning strategy , and is used for some replication types. Given those caveats, I recommend against using a GUID as the primary key and definitely stay away from it for clusters. Yes, there is a sequential uniqueidentifer (newsequentialid()) that can be used to reduce fragmentation. But that doesn’t reduce the other issues when using this as your cluster. It is still larger than I like to use, which can cause performance problems, and since the cluster is stored with each non-clustered index for lookups, it increases the size of all indexes. It generally doesn’t directly represent anything in the business. It can make the attribute unique, but it doesn’t mean anything to a business user. You will still want to define a unique index on the table to enforce business rules. Use GUIDs if they are needed. Avoid them for clusters.
As a data type, Uniqueidentifier stores GUIDs more efficiently than the equivalent string representation. Storing as a uniqueidentifier also clarifies the intent of purpose for the attribute and makes it easier to find when searching metadata. It can also be used for replication, front-end applicatinos and geo partitioning. Use it when necessary, but don’t use it as your cluster.
JSON and XML
Use caution with these types
SQL Server has had native XML support for many versions and the JSON datatype has recently been added as a native type. JSON could be stored in a varchar(max) / nvarchar(max) column before this datatype was introduced, but native support includes JSON indexes in a similar fashion to XML indexes.
These datatypes may be useful for edge cases, but don’t use them to be clever and skip data modeling. It makes everything harder later. Your queries, exports, imports and any other data integrations will all need to shred or parse these datatypes to be useful. This is extra logic and extra processing that can easily be avoided.
The native XML and the new native JSON data types can be tempting to use for application developers. If you use these data types, the individual columns don’t need to be defined in the database and they can be changed dynamically in the application. The native XML and JSON indexes can be used to improve performance. Don’t get pulled into this design trap. It makes the data model hard to traverse. Leave the XML and JSON in the application layer. I have used it for load tables, but it is actually easier to process outside of SQL Server. Ask why you are using these types. Question if they should be in the database or should these be stored as objects in a storage system , or better yet – modeled correctly in the database.
SQL_VARIANT
Use caution with this data type
Sql_variant used as a column type is a code smell. It’s not an absolute “do not use” data type, but you really need to consider the use case and decide if it is appropriate. And if you are making a new table, you need to take a hard look at these attributes before you assign them to a column. If you find yourself considering a cool EAV (entity attribute value) table or pattern and see that you can leverage sql_variant for this pattern, reconsider it. Much has already been written warning about using EAV, but this is a reminder to model things correctly and not use this pattern. SQL Server, or any RDBMS, is probably not the right choice for your design if this is the pattern you want to use. This is a code smell and should be avoided.
Other data types
Specialized types to be used as appropriate
Additional datatypes are available, and it is generally evident when they should be used. These include geography, geometry, hierarchyid, cursors and table types.
Geography and geometry are spatial datatypes, implemented as .NET CLR data types. They can store something as simple as a single geospatial point or an extremely complex geospatial shape. These datatypes can get quite large and have some limitations, such as incompatibility with querying those data types via linked servers. Even with the size of columns and limitations, they are very useful when you need to store spatial data.
The hierarchyid is a very compact datatype useful for storing, as the name suggests, hierarchy information. Hierarchyid values are not assigned automatically and they take some work in the application to apply correctly. They are very efficient, but I have seen a lot of confusion about this type and not many developers are familiar with it. Just because it isn’t a common data type doesn’t mean that you shouldn’t use it, but be aware that you may need to perform additional training and include some guiderails if you use it. Use it as needed, but ensure that the team understands it.
The next two data types can’t be used in table definitions, but are still categorized as data types. You won’t find these in table definitions, but you will likely see them in queries or other SQL scripts.
As the name suggests, the cursor data type is used to define server-side cursors. The standard caveats should be considered when defining cursors. They should be limited in their usage and avoided completely for normal data retrieval. The only time I find a good use for cursors is admin scripts. If you are using them for other purposes, reconsider. It is likely that there is a better method to access the data. The table data type is used for temporary storage. It is required for inline table-valued functions. For other queries you need to decide between a temporary table or a table variable. The decision process for this can be complicated and often requires testing of the specific scenario to choose the best method. Generally, smaller data sets will use table variables. Temp tables are useful when they involve larger data sets or if they need additional indexes applied. Be aware of the main differences and be open to changing the method based on your testing.
Size Boundaries
Be aware of size boundaries with various data types so you choose the best data type. Select the smallest type that will still allow for all foreseeable growth. Refer to the documentation for size limits. Remember that this isn’t just a storage issue. It has a direct impact on performance.
- Exact Numeric
- Bit (1 byte for each 8 bit fields, 0-1)
- Tinyint (1 byte, 0-255)
- Smallint (2 bytes, -2^15 (-32,768) to 2^15-1 (32,767))
- Int (4 bytes, -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647))
- Bigint (8 bytes, -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807))
- Decimal
- Precision 1-9 (5 bytes)
- Precision 10-19 (9 bytes)
- Precision 20-28 (13 bytes)
- Precision 29-38 (17 bytes)
- Date / Datetime
- Date (3 bytes)
- Time (5 bytes)
- SmallDateTime (4 bytes)
- Datetime (8 bytes)
- Datetime2 (6-8 bytes, depending on precision)
- Datetimeoffset (8-10 bytes, depending on precision)
- String
- varcahr 8000
- nvarchar 4000
- varchar / nvarchar max
- Uniqueidentifier (16 bytes)
- Avoid text / ntext / image / blob
Developers should be able to defend any decision for a column definition. That includes the type, the exact size, and the nullability.
Considerations for NULL columns
Before I get too deep into considerations for null values, the concept of null deserves a quick definition. Null isn’t a specific value; it is an unknown value. It doesn’t indicate a blank string, 0, or a specific date. It is unknown.
Null columns can be a contentious topic. I tend to take a pragmatic view. Null values are inevitable with data imports and even with clean, new databases. With a new application where you have complete design control, it can be hard to get the business to agree to your database rules. They want to implement their business rules. It makes sense, since they are the customer in this scenario. Business rules can often be messy and require some flexibility. Null values help implement this flexibility.
Even if you assiduously avoid null values in tables, you will likely have to deal with them at some level. Take that into consideration with your queries and don’t plan on being able to keep null values completely out of your data.
- Adding a NOT NULL column after a table has been created and is populated with data is a more complicated operation. Refer to the documentation for adding columns.
- If the value is not nullable in the database, then the value should not be nullable anywhere in the code (API, GUI, ETL).
- Ask your business analyst / resource analyst if there is a difference between zero and null. That is, does ‘zero’ mean something in the context of the attribute or column.
Magic values
One method for eliminating null is via magic values – a value used in place of null. The system might replace a null date column with ‘1/1/1900’ or a null integer with -9999. I’ve never found magic values to be any less troublesome than null. In fact, I find them to be more cumbersome. The value must be known and tested for in the same way as a null value, but it requires business knowledge to know that the magic value is there. You can usually find these magic values by performing some data profiling, but it still often requires business confirmation. I recommend just using null in these cases.
Default values
Default values are another method for eliminating null columns. The problem with defaults is that it assumes that the default value is correct if you don’t know the actual value of the column. This works for some attributes, but not all of them. The created and modified dates for a column are usually safe to assign a default, but other columns may not be safe with a default. Like magic values, these defaults will be used in aggregates and calculations. Reports, and other queries using these values, will be impacted by their usage. I like to assign default values, when possible, but they should come from business requirements.
Modeling to remove NULL
Modeling key tables using vertical partitioning can be used to eliminate null values. This can become excessive and it is a more complicated pattern. Even with vertical partitioning, it is hard to eliminate all nullable columns. Use this pattern as appropriate and as the skill of your teams allow.
Domain lists
Forcing the table to have a value is the obvious solution to null issues – just don’t allow it. This can be easier said than done. I list domain lists / lookup tables since they are the common way to provide users with a list of valid values for these columns.
Knowing all the valid values for a column, as provided in a domain list, isn’t necessarily enough to remove the null definition. It might make it easier for users, but it can’t make unknown values known. This is the difference between knowing what is valid versus not knowing the value at all.
Tri-valued logic
The main concern with null values is that developers need to understand the tri-valued logic required to properly evaluate null-able columns. Any particular null value isn’t equal to another null value. It is unknown, so you have to test for it separately with IS NULL or test it with a function such as ISNULL or COALESCE to replace the NULL values with a default. This can introduce unintended errors in the code. You can’t realistically remove the need for all nulls, so make sure your teams understand this, understands how to evaluate null, and it is included in your testing.
Code smells
Many of these items are covered in previous sections, but sometimes these items are an indication that there is a problem with the table design. Sometimes they are also a problem when they are seen together or when a certain threshold has been met.
When I see a table with many varchar(255) columns (or varchar(100), nvarchar(200), etc.), it’s an indication that care wasn’t taken during the design of the table. These columns are an indication that the table was based on a spreadsheet or CSV import and the default values were used.
If the total potential column length for the row exceeds 8000, this is another code smell. It’s not wrong every time, but it is an indicator of a potential issue. Remember that short rows usually perform better – more rows fit on a page and each I/O operation brings back more rows. The exception to this is if the table is defined with sparse columns. That shows intent of purpose and it is probably modeled correctly.
There are some additional patterns I consider a code smell. They should be investigated if you see these on a table. If you are doing a code review and see them on a new table, definitely question the reasoning for the following items.
- Many columns with VC100 / vc255, etc.
- Total column length over 8000
- XML
- JSON
- Sql_variant
- Uniqueidentifier on the cluster / PK
- Attribute with different data types between entities
- Only Unicode columns used
Notes for conversions
Data isn’t perfect, especially when you aren’t the source of truth for that data. This makes conversions between types inevitable. Conversions should happen as close to the source as is reasonable. I like to import data to load tables unmodified. This makes testing against the source system much easier. But this raw data is never exposed to end users and conversions happen as data leaves these staging tables.
It’s standard to refine data types during import processes or ETL. The usual reason for this can be that the import format is text only, like a CSV file, it can be to meet enterprise standards, or to align with existing database columns. These conversions should happen as close to the source as possible in your ETL. If you use a load table, I wouldn’t do the transformation during that process, but I would modify the data type as you extract from there. If you import directly to the final table, of course you must do the transformation during the import process.
Making the conversion as close to the source as possible makes it easier to keep the attribute definition consistent in the database. This helps with foreign keys and joins. Data types should match between tables for the same columns. This is forced with foreign keys. Lookups that aren’t explicit FKs won’t have this protection – one more reason to declare these constraints.
Exports and system integrations become easier when the correct data types are used. If you do the analysis on the source of truth, it alleviates that work in connected systems. Data can be fixed by placing a view layer over the poorly modeled tables in your system, but I strongly believe the actual data type in the table should be correct. If you extract the data with a view or stored procedure, the conversion should happen there. Fix the data type close to the source as possible.
Consistency
It is often better to be consistent than correct. If the data governance team is “wrong” or all other systems are not modeled correctly, you only create more problems by fixing the issue in your system. It becomes harder to match and confuses both users and developers. That inconsistency can make problems even worse than the initial bad modeling. Work with your architects at an enterprise level before you become the outlier.
Additional thoughts
Row and page compression remove some of the performance concerns with improper modeling. Even if using row or page compression, data should still be typed correctly. Intent of purpose is a sufficient reason to spend the time designing things properly.
I’d argue that a good design is reason enough to type your attributes correctly. If you get satisfaction from seeing a “correct” system, you understand this. If not, good luck to your team. I hope the performance reasons and ease-of-use reasons are convincing enough. If you’ve read this far, we probably agree.
Collations
Collations can be set at a column level, but table level, and even better, database level collations should be used most of the time. Setting column level collations should be the exception and for specific use cases. Set the collation at the highest level possible. Differing collations can cause issues with comparisons between columns and makes working with the data harder.
Summary
Choosing the correct data type can be tedious, but it’s much more tedious to fix it later. It’s also more time consuming to adjust your queries, ETL, reports and data objects to work around a bad decision. It’s much easier to take your time and choose correctly at the start of a data project. This post outlines my decision process and basic rules, but the most important thing is being consistent and representing the data correctly. Your rules and guidelines may differ, but you should have logical rules that you follow for your environment. Go forth and do the right thing.
Load comments