Normalize strings to optimize space and searches

Comments 0

Share to social media

While this article is specifically geared to SQL Server, the concepts apply to any relational database platform.

The Stack Exchange network logs a lot of web traffic – even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, Migrating a 40TB SQL Server Database.)

It’s no surprise that our log data is massive, but could it be smaller? Let’s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why 50 million visitors a month on Stack Overflow alone can add up quickly and punish our storage:

Now, imagine analysts searching against that data – let’s say, looking for patterns on January 1st for Mathematics and Mathematics Meta. They will most likely write a WHERE clause like this:

These quite expectedly lead to awful execution plans with a residual predicate for the string match on every row in the date range – even if only three rows match. We can demonstrate this with a table and some data:

Now, we run the following query (SELECT * for brevity):

And sure enough, we get a plan that looks okay, especially in Management Studio – it has an index seek and everything! But that is misleading, because that index seek is doing a lot more work than it should. And in the real world, those reads and other costs are much higher, both because there are more rows to ignore, and every row is a lot wider:

Plan and seek details for LIKE against original index

So, then, what if we create an index on CreationDate, HostName?

This doesn’t help. We still get the same plan, because whether or not the HostName shows up in the key, the bulk of the work is filtering the rows to the date range, and then checking the string is still just residual work against all the rows:

Plan and seek details for LIKE against different index

If we could coerce users to use more index-friendly queries without leading wildcards, like this, we may have more options:

In this case, you can see how an index (at least leading) on HostName might help:

But this will only be effective if we can coerce users to change their habits. If they keep using LIKE, the query will still choose the older index (and results aren’t any better even if you force the new index). But with an index leading on HostName and a query that caters to that index, the results are drastically better:

Plan and seek details for IN against better index

This is not overly surprising, but I do empathize that it can be hard to persuade users to change their query habits and, even when you can, it can also be hard to justify creating additional indexes on already very large tables. This is especially true of log tables, which are write-heavy, append-only, and read-seldom.

When I can, I would rather focus on making those tables smaller in the first place.

Compression can help to an extent, and columnstore indexes, too (though they bring additional considerations). You may even be able to halve the size of this column by convincing your team that your host names don’t need to support Unicode.

For the adventurous, I recommend better normalization!

Think about it: If we log a million times today, does it make sense to store 44 bytes, a million times? I would rather store that value in a lookup table once, the first time we see it, and then use a surrogate identifier that can be much smaller. In our case, since we know we won’t ever have over 32,000 Q&A sites, we can use smallint (2 bytes) – saving 22 bytes per row (or more, for longer host names). Picture this slightly different schema:

Then the app maps the host name to its identifier (or inserts a row if it can’t find one), and inserts the id into the log table instead:

First, let’s compare the sizes:

Space details for more normalized design

That’s a 36% reduction in space. When we’re talking about terabytes of data, this can drastically reduce storage costs and significantly extend the utility and lifetime of existing hardware.

What about our queries?

The queries are slightly more complicated, because now you have to perform a join:

In this case, using LIKE isn’t a disadvantage unless the host names table becomes massive. The plans are almost identical, except that the LIKE variation can’t perform a seek. (Both are a little more horrible at estimates, but I don’t see how being any worse off the mark here will ever pick any other plan.)

Plans against more normalized design

This is one scenario where the difference between a seek and a scan is such a small part of the plan it is unlikely to be significant.

This strategy still requires analysts and other end users to change their query habits a little, since they need to add the join compared to when all of the information was in a single table. But you can make this change relatively transparent to them by offering views that simplify the joins away:

Never use SELECT * in a view. 🙂

Querying the view, like this, still leads to the same efficient plans, and lets your users use almost exactly the same query they’re used to:

Existing infrastructure vs. greenfield

Even scheduling the maintenance to change the way the app writes to the log table – never mind actually making the change – can be a challenge. This is especially difficult if logs are merely bulk inserted or come from multiple sources. It may be sensible to use a new table (perhaps with partitioning) and start adding new data going forward in a slightly different format, and leaving the old data as is, using a view to bridge them.

That could be complex, and I don’t mean to trivialize this change. I wanted to post this more as a “lessons learned” kind of thing – if you are building a system where you will be logging a lot of redundant string information, think about a more normalized design from the start. It’s much easier to build it this way now than to shoehorn it later… when it may well be too big and too late to change. Just be careful to not treat all potentially repeating strings as repeating enough to make normalization pay off. For host names at Stack Overflow, we have a predictably finite number, but I wouldn’t consider this for user agent strings or referrer URLs, for example, since that data is much more volatile and the number of unique values will be a large percentage of the entire set.