{"id":97577,"date":"2023-10-02T14:54:18","date_gmt":"2023-10-02T14:54:18","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=97577"},"modified":"2023-09-19T18:39:09","modified_gmt":"2023-09-19T18:39:09","slug":"normalize-strings-to-optimize-space-and-searches","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/performance-sql-server\/normalize-strings-to-optimize-space-and-searches\/","title":{"rendered":"Normalize strings to optimize space and searches"},"content":{"rendered":"<p><em>While this article is specifically geared to SQL Server, the concepts apply to any relational database platform.<\/em><\/p>\n<p>The Stack Exchange network logs <strong>a lot<\/strong> of web traffic &#8211; even compressed, we average well over a terabyte per month. And that is just a <em>summarized<\/em> cross-section of our overall raw log data, which we load into a database for downstream security and analytical purposes. Every month has its own table, allowing for partitioning-like sliding windows and selective indexes without the additional restrictions and management overhead. (Taryn Pratt talks about these tables in great detail in her post, <a href=\"https:\/\/www.tarynpivots.com\/post\/migrating-40tb-sql-server-database\/\" target=\"_blank\" rel=\"noopener\">Migrating a 40TB SQL Server Database<\/a>.)<\/p>\n<p>It&#8217;s no surprise that our log data is massive, but could it be smaller? Let&#8217;s take a look at a few typical rows. While these are not all of the columns or the exact column names, they should give an idea why <a href=\"https:\/\/stackoverflow.co\/advertising\/audience\/\" target=\"_blank\" rel=\"noopener\">50 million visitors a month<\/a> on Stack Overflow alone can add up quickly and punish our storage:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:12 line-height:16 decode-attributes:false tab-convert:true lang:none decode:true whitespace-before:1 whitespace-after:1\">  HostName                     Uri               QueryString     CreationDate    ...\r\n  nvarchar(250)                nvarchar(2048)    nvarchar(2048)  datetime\r\n  ---------------------------  ----------------  --------------  --------------\r\n  math.stackexchange.com       \/questions\/show\/                  2023-01-01 ...\r\n  math.stackexchange.com       \/users\/56789\/     ?tab=active     2023-01-01 ...\r\n  meta.math.stackexchange.com  \/q\/98765\/         ?x=124.54       2023-01-01 ... \r\n<\/pre>\n<p>Now, imagine analysts searching against that data &#8211; let&#8217;s say, looking for patterns on January 1st for Mathematics and Mathematics Meta. They will most likely write a <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">WHERE<\/code> clause like this:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\">  WHERE CreationDate &gt;= '20230101' \r\n    AND CreationDate &lt;  '20230102'\r\n    AND HostName LIKE N'%math.stackexchange.com';<\/pre>\n<p>These quite expectedly lead to awful execution plans with a residual predicate for the string match on every row in the date range &#8211; even if only three rows match. We can demonstrate this with a table and some data:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> CREATE TABLE dbo.GinormaLog_OldWay\r\n (\r\n   Id           int IDENTITY(1,1),\r\n   HostName     nvarchar(250),\r\n   Uri          nvarchar(2048),\r\n   QueryString  nvarchar(2048),\r\n   CreationDate datetime,\r\n   \/* other columns *\/\r\n   CONSTRAINT PK_GinormaLog_OldWay PRIMARY KEY(Id)\r\n );\r\n\r\n \/* populate a few rows to search for (the needles) *\/\r\n\r\n DECLARE @math nvarchar(250) = N'math.stackexchange.com';\r\n DECLARE @meta nvarchar(250) = N'meta.' + @math;\r\n\r\n INSERT dbo.GinormaLog_OldWay(HostName, Uri, QueryString, CreationDate)\r\n VALUES\r\n   (@math, N'\/questions\/show\/', N'',            '20230101 01:24'),\r\n   (@math, N'\/users\/56789\/',    N'?tab=active', '20230101 01:25'),\r\n   (@meta, N'\/q\/98765\/',        N'?x=124.54',   '20230101 01:26'); \r\n GO\r\n\r\n \/* put some realistic other traffic (the haystack) *\/\r\n\r\n INSERT dbo.GinormaLog_OldWay(HostName, Uri, QueryString, CreationDate)\r\n SELECT TOP (1000) N'stackoverflow.com', CONCAT(N'\/q\/', ABS(object_id % 10000)),\r\n   CONCAT(N'?x=', name), DATEADD(minute, column_id, '20230101')\r\n   FROM sys.all_columns;\r\n GO 10\r\n\r\n INSERT dbo.GinormaLog_OldWay(HostName, Uri, QueryString, CreationDate)\r\n SELECT TOP (1000) N'stackoverflow.com', CONCAT(N'\/q\/', ABS(object_id % 10000)),\r\n   CONCAT(N'?x=', name), DATEADD(minute, -column_id, '20230125')\r\n   FROM sys.all_columns;\r\n GO\r\n\r\n \/* create an index to cater to many searches *\/\r\n\r\n CREATE INDEX DateRange\r\n   ON dbo.GinormaLog_OldWay(CreationDate) \r\n   INCLUDE(HostName, Uri, QueryString);\r\n<\/pre>\n<p>Now, we run the following query (<code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">SELECT *<\/code> for brevity):<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> SELECT * \r\n   FROM dbo.GinormaLog_OldWay\r\n   WHERE CreationDate &gt;= '20230101' \r\n     AND CreationDate &lt;  '20230102'\r\n     AND HostName LIKE N'%math.stackexchange.com';<\/pre>\n<p>And sure enough, we get a plan that <em>looks<\/em> okay, especially in Management Studio &#8211; it has an index seek and everything! But that is misleading, because that index seek is doing a lot more work than it should. And in the real world, those reads and other costs are much higher, both because there are more rows to ignore, <em>and<\/em> every row is a lot wider:<\/p>\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan_1.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 1px solid #aaa;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan_1.png\" alt=\"Plan and seek details for LIKE against original index\" width=\"75%\" \/><\/a><\/p>\n<p>So, then, what if we create an index on <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">CreationDate, HostName<\/code>?<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> CREATE INDEX Date_HostName\r\n   ON dbo.GinormaLog_OldWay(CreationDate, HostName) \r\n   INCLUDE(Uri, QueryString);<\/pre>\n<p>This doesn&#8217;t help. We still get the same plan, because whether or not the <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">HostName<\/code> shows up in the key, the bulk of the work is filtering the rows to the date range, and then checking the string is still just residual work against all the rows:<\/p>\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan_2.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 1px solid #aaa;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan_2.png\" alt=\"Plan and seek details for LIKE against different index\" width=\"75%\" \/><\/a><\/p>\n<p>If we could coerce users to use more index-friendly queries <em>without<\/em> leading wildcards, like this, we may have more options:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> WHERE CreationDate &gt;= '20230101' \r\n   AND CreationDate &lt;  '20230102'\r\n   AND HostName IN \r\n   (\r\n      N'math.stackexchange.com', \r\n      N'meta.math.stackexchange.com'\r\n   );<\/pre>\n<p>In this case, you can see how an index (at least leading) on <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">HostName<\/code> <em>might<\/em> help:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> CREATE INDEX HostName\r\n   ON dbo.GinormaLog_OldWay(HostName, CreationDate) \r\n   INCLUDE(Uri, QueryString);<\/pre>\n<p>But this will only be effective if we can coerce users to change their habits. If they keep using <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">LIKE<\/code>, the query will still choose the older index (and results aren&#8217;t any better even if you force the new index). But with an index leading on <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">HostName<\/code> <em>and<\/em> a query that caters to that index, the results are drastically better:<\/p>\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan_3.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 1px solid #aaa;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan_3.png\" alt=\"Plan and seek details for IN against better index\" width=\"75%\" \/><\/a><\/p>\n<p>This is not overly surprising, but I do empathize that it can be hard to persuade users to change their query habits and, even when you can, it can also be hard to justify creating additional indexes on <em>already very large tables<\/em>. This is especially true of log tables, which are write-heavy, append-only, and read-seldom.<\/p>\n<p>When I can, <b>I would rather focus on making those tables smaller in the first place.<\/b><\/p>\n<p>Compression can help to an extent, and columnstore indexes, too (though they bring additional considerations). You may even be able to halve the size of this column by convincing your team that <em>your<\/em> host names don&#8217;t need to support Unicode.<\/p>\n<p><b>For the adventurous, I recommend better normalization!<\/b><\/p>\n<p>Think about it: If we log <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">math.stackexchange.com<\/code> a million times today, does it make sense to store 44 bytes, <em>a million times<\/em>? I would rather store that value in a lookup table <em>once<\/em>, the first time we see it, and then use a surrogate identifier that can be much smaller. In our case, since we know we won&#8217;t ever have over 32,000 Q&amp;A sites, we can use <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">smallint<\/code> (2 bytes) &#8211; saving 22 bytes per row (or more, for longer host names). Picture this slightly different schema:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> \r\n CREATE TABLE dbo.NormalizedHostNames\r\n (\r\n   HostNameId smallint identity(1,1),\r\n   HostName nvarchar(250) NOT NULL,\r\n   CONSTRAINT PK_NormalizedHostNames PRIMARY KEY(HostNameId),\r\n   CONSTRAINT UQ_NormalizedHostNames UNIQUE(HostName)\r\n );\r\n\r\n INSERT dbo.NormalizedHostNames(HostName)\r\n VALUES(N'math.stackexchange.com'),\r\n       (N'meta.math.stackexchange.com'),\r\n       (N'stackoverflow.com');\r\n\r\n CREATE TABLE dbo.GinormaLog_NewWay\r\n (\r\n   Id           int identity(1,1) NOT NULL,\r\n   HostNameId   smallint NOT NULL FOREIGN KEY \r\n                REFERENCES dbo.NormalizedHostNames(HostNameId),\r\n   Uri          nvarchar(2048),\r\n   QueryString  nvarchar(2048),\r\n   CreationDate datetime,\r\n   \/* other columns *\/\r\n   CONSTRAINT PK_GinormaLog_NewWay PRIMARY KEY(Id)\r\n );\r\n\r\n CREATE INDEX HostName\r\n   ON dbo.GinormaLog_NewWay(HostNameId, CreationDate) \r\n   INCLUDE(Uri, QueryString);\r\n<\/pre>\n<p>Then the app maps the host name to its identifier (or inserts a row if it can&#8217;t find one), and inserts the id into the log table instead:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> \/* populate a few rows to search for (the needles) *\/\r\n\r\n INSERT dbo.GinormaLog_NewWay(HostNameId, Uri, QueryString, CreationDate)\r\n VALUES\r\n   (1, N'\/questions\/show\/', N'',            '20230101 01:24'),\r\n   (1, N'\/users\/56789\/',    N'?tab=active', '20230101 01:25'),\r\n   (2, N'\/q\/98765\/',        N'?x=124.54',   '20230101 01:26'); \r\n GO\r\n\r\n \/* put some realistic other traffic (the haystack) *\/\r\n\r\n INSERT dbo.GinormaLog_NewWay(HostNameId, Uri, QueryString, CreationDate)\r\n SELECT TOP (1000) 3, CONCAT(N'\/q\/', ABS(object_id % 10000)),\r\n   CONCAT(N'?x=', name), DATEADD(minute, column_id, '20230101')\r\n   FROM sys.all_columns;\r\n GO 10\r\n\r\n INSERT dbo.GinormaLog_NewWay(HostNameId, Uri, QueryString, CreationDate)\r\n SELECT TOP (1000) 3, CONCAT(N'\/q\/', ABS(object_id % 10000)),\r\n   CONCAT(N'?x=', name), DATEADD(minute, -column_id, '20230125')\r\n   FROM sys.all_columns;<\/pre>\n<p><b>First, let&#8217;s compare the sizes:<\/b><\/p>\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_spaceused.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 1px solid #aaa;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_spaceused.png\" alt=\"Space details for more normalized design\" width=\"50%\" \/><\/a><\/p>\n<p>That&#8217;s a <b>36% reduction<\/b> in space. When we&#8217;re talking about terabytes of data, this can drastically reduce storage costs and significantly extend the utility and lifetime of existing hardware.<\/p>\n<p><b>What about our queries?<\/b><\/p>\n<p>The queries are slightly more complicated, because now you have to perform a join:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> SELECT * \r\n FROM dbo.GinormaLog_NewWay AS nw\r\n INNER JOIN dbo.NormalizedHostNames AS hn\r\n   ON nw.HostNameId = hn.HostNameId\r\n WHERE nw.CreationDate &gt;= '20230101' \r\n   AND nw.CreationDate &lt;  '20230102'\r\n   AND hn.HostName LIKE N'%math.stackexchange.com';\r\n   \/* -- or \r\n   AND (hn.HostName IN (N'math.stackexchange.com',\r\n        N'meta.math.stackexchange.com')); *\/<\/pre>\n<p>In this case, using <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">LIKE<\/code> isn&#8217;t a disadvantage unless the host names table becomes massive. The plans are almost identical, except that the <code style=\"font-family: consolas; background: #e4e4e4; padding: 3px 4px 1px 4px; border-radius: 3px;\">LIKE<\/code> variation can&#8217;t perform a seek. (Both are a little more horrible at estimates, but I don&#8217;t see how being any worse off the mark here will ever pick any other plan.)<\/p>\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan4.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" style=\"border: 1px solid #aaa;\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/09\/nrm_plan4.png\" alt=\"Plans against more normalized design\" width=\"75%\" \/><\/a><\/p>\n<p>This is one scenario where the difference between a seek and a scan is such a small part of the plan it is unlikely to be significant.<\/p>\n<p>This strategy still requires analysts and other end users to change their query habits <em>a little<\/em>, since they need to add the join compared to when all of the information was in a single table. But you can make this change relatively transparent to them by offering views that simplify the joins away:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> CREATE VIEW dbo.vGinormaLog\r\n AS\r\n   SELECT nw.*, hn.HostName\r\n   FROM dbo.GinormaLog_NewWay AS nw\r\n   INNER JOIN dbo.NormalizedHostNames AS hn\r\n     ON nw.HostNameId = hn.HostNameId;<\/pre>\n<p><em style=\"font-size: 0.875rem;\">Never use SELECT * in a view. \ud83d\ude42<\/em><\/p>\n<p>Querying the view, like this, still leads to the same efficient plans, and lets your users use almost exactly the same query they&#8217;re used to:<\/p>\n<pre class=\"theme:tomorrow-night font:consolas font-size:14 line-height:16 decode-attributes:false tab-convert:true lang:tsql decode:true whitespace-before:1 whitespace-after:1\"> SELECT * FROM dbo.vGinormaLog\r\n   WHERE CreationDate &gt;= '20230101' \r\n     AND CreationDate &lt;  '20230102'\r\n     AND HostName LIKE N'%math.stackexchange.com';<\/pre>\n<p><b>Existing infrastructure vs. greenfield<\/b><\/p>\n<p>Even scheduling the maintenance to change the way the app writes to the log table &#8211; never mind actually making the change &#8211; can be a challenge. This is especially difficult if logs are merely bulk inserted or come from multiple sources. It may be sensible to use a new table (perhaps with partitioning) and start adding new data going forward in a slightly different format, and leaving the old data as is, using a view to bridge them.<\/p>\n<p>That could be complex, and I don&#8217;t mean to trivialize this change. I wanted to post this more as a &#8220;lessons learned&#8221; kind of thing &#8211; if you are building a system where you will be logging a lot of redundant string information, think about a more normalized design <b>from the start<\/b>. It&#8217;s much easier to build it this way now than to shoehorn it later&#8230; when it may well be too big and too late to change. Just be careful to not treat <em>all<\/em> potentially repeating strings as <em>repeating enough to make normalization pay off<\/em>. For host names at Stack Overflow, we have a predictably finite number, but I wouldn&#8217;t consider this for user agent strings or referrer URLs, for example, since that data is much more volatile and the number of unique values will be a large percentage of the entire set.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While this article is specifically geared to SQL Server, the concepts apply to any relational database platform. The Stack Exchange network logs a lot of web traffic &#8211; even compressed, we average well over a terabyte per month. And that is just a summarized cross-section of our overall raw log data, which we load into&#8230;&hellip;<\/p>\n","protected":false},"author":341115,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143529,143531],"tags":[5661],"coauthors":[158980],"class_list":["post-97577","post","type-post","status-publish","format-standard","hentry","category-featured","category-performance-sql-server","category-t-sql-programming-sql-server","tag-normalization"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97577","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/341115"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=97577"}],"version-history":[{"count":67,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97577\/revisions"}],"predecessor-version":[{"id":103639,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97577\/revisions\/103639"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=97577"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=97577"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=97577"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=97577"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}