Eliminating Annoying Space Characters Via Check Constraints

Comments 0

Share to social media

Along with designing databases, I do a lot of ETL from, well, let’s call them “third party systems that I did not design or implement.” That may seem arrogant, but hey, I write books about database design so I should at least seem like I am able to do things right. In either case, when writing ETL packages to transform data from one system to another, data integrity issues will drive you nuts.

One of the biggest offenders turns out to be the simplest to stop in your OLTP systems. Leading and trailing spaces, and columns that are 100% space characters. Now, I am not at all saying that such data ought to be made invalid by the people who implement SQL Server, nor would I consider this something to beg the SQL standards committees to handle. As with the design of any piece of software, it is imperative to allow for the most lenient data that is acceptable. And all three situations we will cover tonight are acceptable in some small edge cases, but for most data, really not at all.

One part of the problem is that (at least in SQL Server,) comparisons with spaces quite conveniently handle trailing spaces. Multiple trailing space characters are treated as equivalent to one another in a comparison. For example:

SELECT CASE WHEN 'Fred' = 'Fred ' THEN 'Same' ELSE 'Different' END AS Comparison;

Intuitively, you probably expect this to return ‘Different’, but SQL behaves different than you expect (something I find I write in my blog and say a lot more than you might expect). It returns same.

Comparison
----------
Same

And even if you do the following, with LOTS of trailing space characters, or an empty string:

SELECT CASE WHEN 'Fred' + REPLICATE(45,' ') = 'Fred' + REPLICATE(10000,' ')
              
THEN 'Same' ELSE 'Different' END AS Comparison;
SELECT CASE WHEN '' = REPLICATE(10000,' ')
               THEN 'Same' ELSE 'Different' END AS Comparison;

Well, same thing, these two queries returns Same. Trailing space characters aren’t considered in the comparison. An empty string is equal in a comparison to any number of spaces. Leading spaces, on the other hand are compared as different, thankfully:

SELECT CASE WHEN ' Fred' = 'Fred'
               THEN 'Same' ELSE 'Different' END AS Comparison;

This returns:

Comparison
----------
Different

So at least that behaves as expected. The problem is, a user enters: LastName = ‘Fred ‘, FirstName = ‘Jones ‘. In a SQL query, WHERE LastName = ‘Fred’ will match this row. But if you are ever using another tool to fetch the data, for example SSIS, generally speaking ‘Fred’ will not be equivalent to ‘Fred ‘. This will probably lighten the hair color on your head to start with, as you realize that this means that EVERY place you do a comparison, you will need to do a TRIM on the data (and tools like SSIS are case sensitive too, though this actually will be in your favor.) Code generation can help, SQL based, BIML based, whatever. But what really helps is if the source data is not contaminated with undesired space characters to start with.

If you own the source system (or feel it worthwhile to add constraints to a system you do not own,) one thing you can do is make sure that the user interface and object code trims the data, and if the column is empty or 100% filled with spaces, change it to NULL. Then, use a CHECK constraint to make 100% sure it happens.

Something like the following, starting by creating a simple table:

CREATE SCHEMA Demo;
GO
CREATE TABLE Demo.Person
(
    PersonId int IDENTITY NOT NULL CONSTRAINT PKPerson PRIMARY KEY,
    FirstName nvarchar(50) NOT NULL,
    MiddleName nvarchar(50) NULL,
    LastName nvarchar(50) NOT NULL
)

Then, let’s say the requirements are as follows:

1. No values that are either empty or only spaces
2. No leading spaces
3. No trailing spaces
4. Allow NULL if column allows NULL

Let’s look at how we could implement all of these independently, as there certainly are cases where you may wish to allow any or all of the situations in a column.

For #1, we can use a couple of things. Either: CHECK (Column <> ”) or any number of spaces, as we have discussed. Or you can use: LEN(Column) <> 0 as LEN(‘     ‘) will result in 0 (Use DATA_LENGTH if you want to see a size of the value, but remember that this deals with physical sizes, so Unicode data will have 2 (or VERY rarely, 3) bytes per character.

For #2, we can fail if LEN(LTRIM(Column)) <> LEN(Column). Blank will pass this check, but not ‘ Fred’.

For #3, it may seem like we could swap out the LTRIM for RTRIM, but LEN(‘Fred’) will return the same value as LEN(‘Fred ‘). So one thing you can do is fail if LEN(‘*’ + RTRIM(Column) + ‘*’) <> LEN(‘*’ + Column + ‘*’), because now the expression will never have an issue with leading or trailing spaces.

For #4, it turns out to be baked into the definition of a nullable column. If the column allows NULL, and the CHECK constraint’s expression returns NULL, then the operation succeeds. If the column does not allow NULL values, then the operation succeeds only if the expression returns TRUE.

So for our table, let’s implement 3 CHECK constraints. First checking that the LEN <> 0, then that the length trimmed is the same as the untrimmed length. Reduce the checks if you just want to eliminate one of the cases.:

ALTER TABLE Demo.Person --USING TRIM, that is new to 2017,
                        --in earlier use RTRIM(LTRIM(

   ADD CONSTRAINT CHK_Person$FirstName CHECK(LEN(FirstName) <> 0
      AND LEN('*' + FirstName + '*') = LEN('*' + TRIM(FirstName) + '*'));

ALTER TABLE Demo.Person 
   ADD CONSTRAINT CHK_Person$MiddleName CHECK(LEN(MiddleName) <> 0
      AND LEN('*' + MiddleName + '*') = LEN('*' + TRIM(MiddleName) + '*'));

ALTER TABLE Demo.Person 
   ADD CONSTRAINT CHK_Person$LastName CHECK(LEN(LastName) <> 0
      AND LEN('*' + LastName + '*') = LEN('*' + TRIM(LastName) + '*'));
GO

Now let’s test. First, rows that should work:

INSERT INTO Demo.Person(FirstName, MiddleName, LastName)
VALUES(N'Lenny', N'H', N'McGoonie');

INSERT INTO Demo.Person( FirstName, MiddleName, LastName)
VALUES( N'Ron', NULL, N'Boo');

And a few rows that should not

INSERT INTO Demo.Person( FirstName, MiddleName, LastName)
VALUES(N'', NULL, N'Boo'); --Empty FirstName

INSERT INTO Demo.Person(FirstName, MiddleName, LastName)
VALUES(N' Fred', NULL, N'Boo'); --Leading space

INSERT INTO Demo.Person(FirstName, MiddleName, LastName)
VALUES( N'Fred ', NULL, N'Boo'); --Trailing space

Now, all three of these will fail with the same error message, showing it worked.:

Msg 547, Level 16, State 0, Line 92
The INSERT statement conflicted with the CHECK constraint "CHK_Person$FirstName". The conflict occurred in database "tempdb", table "Demo.Person", column 'FirstName'.

Lastly though, you have to wonder, with these three ugly looking constraints, will it have an effect on performance. So let’s insert some data from the WideWorldImporters database and see. I will just use some name-like data from the Application.People table (which ironically has data in the PreferredName column that has trailing spaces, probably for demonstrating exactly what I am.)

SELECT '*' + PreferredName + '*'
FROM   WideWorldImporters.Application.People
WHERE  NOT(LEN(PreferredName) <> 0
       AND LEN('*' + PreferredName + '*')
                   = LEN('*' + TRIM(PreferredName) + '*'))

This returns the following data (which we will not want to insert or it will fail!):
*Isabell *
*Beau *
*Bernadette *
*Danielle *
*Kristophe *
*Madelaine *
*Alexandre *
*Annette *

I will, for this case, do two tests, one by inserting over a million rows in a single statement, and then over 360K rows, row by row, with and without constraints. This will give you a rough feeling for the difference in performance. This will be done on a Surface Pro 4, dual core i7, 256GB SSD.

For the single statement test, I will use the following, and used the elapsed time from STATISTICS TIME:

TRUNCATE TABLE Demo.Person;

SET STATISTICS TIME ON;
INSERT INTO Demo.Person(FirstName, MiddleName, LastName)
SELECT TRIM(People.FullName), 'H', TRIM(People.PreferredName)
       -- TRIM in because some data failed the test

FROM WideWorldImporters.Application.People
        CROSS JOIN sys.types --34 rows to give us 37774 rows
        CROSS JOIN sys.types AS types2--34 rows to give us 1284316 rows
SET STATISTICS TIME OFF;

With the constraints on, this took the following amount of time. Run it several times, we don’t want any caching to get in the way, and ideally all of the data will be cached so the primary time spend is in the creating data and checking constraints. It took around 3.3 seconds.

SQL Server Execution Times:
CPU time = 3328 ms, elapsed time = 3382 ms.

Then I dropped the constraints (though you could also disable them.)

ALTER TABLE Demo.Person
    DROP CONSTRAINT CHK_Person$FirstName; 
ALTER TABLE Demo.Person
    DROP CONSTRAINT CHK_Person$MiddleName; 
ALTER TABLE Demo.Person
    DROP CONSTRAINT CHK_Person$LastName;

And reexecute the TRUNCATE and INSERT, and I get the following time

SQL Server Execution Times:
CPU time = 1782 ms, elapsed time = 2328 ms.

So did it TAKE longer? Of course, it had to do more work. To insert over 1.2 million rows, this took 1 more second to insert and check the data. Worth it? Generally speaking, if it keeps you from having to recheck it EVERY STINKING TIME YOU USE THE DATA, then yes, this one hit is probably worth it. But your usages will differ, and it may be too much overhead in your usage. 

I also did a quick test in a loop, to see what the difference was doing row by row inserts. I did this with 1 million loops (there are only 1111 rows in the Person table, with a max of 3261 for the key), and after 340K rows, without constraints, 24 seconds, with 29. So it is overhead, it just depends on how much data you have to load, and how much cleaning you have to do to get things done.

The code for those tests follows, along with detailed results. 

DECLARE @I int = 0, @startTime datetime2(7) = SYSDATETIME();
TRUNCATE TABLE Demo.Person;
SET NOCOUNT ON

WHILE @I < 1000000
BEGIN
    INSERT INTO Demo.Person(FirstName, MiddleName, LastName)
    SELECT TRIM(People.FullName), 'H', TRIM(People.PreferredName) 
    FROM WideWorldImporters.Application.People
    WHERE PersonID = @I % 3261 + 1; --from minimum 1 to max personId in table 3261

    SET @I = @I + 1;
END;

SELECT COUNT(*) AS RowsInserted, DATEDIFF(MILLISECOND, @startTime, SYSDATETIME()) / 1000.0 AS ElapsedSeconds
FROM Demo.Person;
GO

RowsInserted ElapsedSeconds (with constraints)
------------ ---------------------------------------
340548       29.291000

RowsInserted ElapsedSeconds (without constraints)
------------ ---------------------------------------
340548       25.492000