{"id":73222,"date":"2013-10-16T08:39:25","date_gmt":"2013-10-16T08:39:25","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/uncategorized\/histograms-part-3-when\/"},"modified":"2021-07-14T13:07:37","modified_gmt":"2021-07-14T13:07:37","slug":"histograms-part-3-when","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/oracle-databases\/histograms-part-3-when\/","title":{"rendered":"Histograms Part 3 \u2013 When?"},"content":{"rendered":"<p>In <a href=\"https:\/\/allthingsoracle.com\/histograms-pt-2\/\" title=\"Histograms Part 2\">part 2<\/a> we looked at the way that Oracle collects and uses height \u2013balanced histograms, and raised the problem of how sensitive they could be to small changes in the sampling, or in the data. We ended the article with a suggestion on how you could generate data for an \u201capproximate\u201d frequency histogram \u00a0that could be \u201cgood enough\u201d to help the optimizer do the right thing.<\/p>\n<p>In part 3, then, we\u2019re going to look at a couple of cases where allowing Oracle to do its own thing could well be a sensible, and adequate, production strategy. As a general guideline, I&#8217;ve only found height-balanced histograms to be reasonably safe when you use them to \u201ctake out\u201d a very small number of high frequency values \u2013 in effect allowing the optimizer to optimize for the remaining \u201cinteresting\u201d data. There are, however, a couple of design errors where you can use histograms as a damage-limitation mechanism,<\/p>\n<p>I had been planning to make some comments about how to incorporate histogram collection into a complete stats collection strategy \u2013 but this article became longer than I expected, so I&#8217;m going to postpone my thoughts on the actual stats collection implementation to another day.<\/p>\n<h2>Database-agnostic applications<\/h2>\n<p>There are applications in the wild that have been written to run against \u201cany\u201d database, and they try to minimize their dependencies on any features of any given database. As a consequence, \u201cruns on any database\u201d tends to mean \u201cruns badly on any database\u201d,<\/p>\n<p>From a pure data perspective, two of the most common aspects of agnosticism are:<\/p>\n<ul>\n<li>We do not use NULL \u2013 we use a special (extreme) value.<\/li>\n<li>We do not use date datatypes<\/li>\n<\/ul>\n<p>The side-effects on Oracle\u2019s optimizer can be catastrophic.<\/p>\n<h2>We do not use NULL<\/h2>\n<p>Here\u2019s a script to create a simple data set to make the point. The data represents 5 years of data (starting on 1<sup>st<\/sup> Jan 2008) at 100 rows per day, with every 1000<sup>th<\/sup> row set to the \u201cpseudo- null\u201d value of 31<sup>st<\/sup> Dec 4000. After creating the data I gathered statistics without generating any histograms, i.e. using <em>method_opt =&gt; \u2018for all columns size 1\u2019<\/em><\/p>\n<pre>\r\ncreate table t1\r\nas\r\nwith generator as (\r\n     select\u00a0\u00a0\u00a0\u00a0 --+ materialize\r\n          rownum \u00a0\u00a0\u00a0 id\r\n     from all_objects\r\n     where rownum &lt;= 2000\r\n)\r\nselect\r\n     \/*+ ordered use_nl(v2) *\/\r\n     decode(\r\n          mod(rownum - 1,1000),\r\n               0,to_date('31-Dec-4000'),\r\n                 to_date('01-Jan-2008') + trunc((rownum - 1)\/100)\r\n     )\u00a0\u00a0\u00a0 date_closed\r\nfrom\r\n     generator\u00a0 v1,\r\n     generator\u00a0 v2\r\nwhere\r\n     rownum &lt;= 1827 * 100\r\n;\r\n<\/pre>\n<p>I&#8217;m not going to worry about indexes or details of optimum access paths, all I&#8217;m interested in at present is the optimizer\u2019s cardinality calculation, and how close it gets to the right answer. Based on the SQL you can see that a query for \u201call the data for 2010\u201d should return 36,500 rows \u2013 so let\u2019s write a query that asks for that data, and check the resulting execution plan for its row prediction:<\/p>\n<pre>\r\nset autotrace traceonly explain\r\n\r\nselect\r\n     *\r\nfrom t1\r\nwhere date_closed between to_date('01-Jan-2010','dd-mon-yyyy')\r\n               and\u00a0\u00a0\u00a0\u00a0 to_date('31-Dec-2010','dd-mon-yyyy')\r\n;\r\n\r\n----------------------------------------------------------\r\n| Id\u00a0 | Operation\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | Name | Rows\u00a0 | Bytes | Cost\u00a0 |\r\n---------------------------------------------------------\r\n|\u00a0\u00a0 0 | SELECT STATEMENT\u00a0 |\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0 291 |\u00a0 2328 |\u00a0\u00a0\u00a0 52 |\r\n|*\u00a0 1 |\u00a0 TABLE ACCESS FULL| T1\u00a0\u00a0 |\u00a0\u00a0 291 |\u00a0 2328 |\u00a0\u00a0\u00a0 52 |\r\n----------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   1 - filter(\"DATE_CLOSED\"&gt;=TO_DATE(' 2010-01-01 00:00:00',\r\n              'syyyy-mm-dd hh24:mi:ss') AND \"DATE_CLOSED\"&lt;=TO_DATE(' 2010-12-31\r\n              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))\r\n<\/pre>\n<p>At 291 rows predicted, the optimizer seems to be out by a factor of 125; in most circumstance an error this large is likely to produce a bad choice of execution plan.<\/p>\n<p>Why has this happened? The basic optimizer model assumes that your data is evenly spread between the low and high values, so Oracle\u2019s estimate (ignoring a couple of details round the edges) is:<\/p>\n<pre>\r\n\u201crows in table with non-null value\u201d * \u201crange you want\u201d \/ \u201chigh value \u2013 low value\u201d.\r\n<\/pre>\n<p>In my case this is (approximately);<\/p>\n<pre>\r\n182700 * 365 \/ 727928 + 200\r\n<\/pre>\n<p>Oracle thinks I want one year out of nearly 2,000 rather than one year out of five (with two corrections of 100 rows, which are to allow for end the \u2013points on a bounded range).<\/p>\n<p>At this point I could simply write a little piece of PL\/SQL that called <strong><em>dbms_stats.set_column_stats()<\/em><\/strong> to change the high value for the column to something much closer to the \u201cproper\u201d high value. If the volume of outlying data is tiny this would be a reasonable strategy, you just have to remember that a query for the \u201cpseudo-null\u201d would then be so far outside the known range that the optimizer would probably give a cardinality estimate of 1 for that value, no matter how much data really existed.<\/p>\n<p>Rather than hacking the statistics, though, I&#8217;m going to create a histogram on the column. Typically I would aim for 254 columns (the maximum in 11g), but as a demonstration of how effective they can be I&#8217;m actually going to show you the effect of creating a histogram with just 11 columns (2 per year, plus one spare) \u2013 this is how the plan changes:<\/p>\n<pre>\r\n----------------------------------------------------------\r\n| Id\u00a0 | Operation\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | Name | Rows\u00a0 | Bytes | Cost\u00a0 |\r\n----------------------------------------------------------\r\n|\u00a0\u00a0 0 | SELECT STATEMENT\u00a0 |\u00a0\u00a0\u00a0\u00a0\u00a0 | 36420 |\u00a0\u00a0 284K|\u00a0\u00a0\u00a0 52 |\r\n|*\u00a0 1 |\u00a0 TABLE ACCESS FULL| T1\u00a0\u00a0 | 36420 |\u00a0\u00a0 284K|\u00a0\u00a0\u00a0 52 |\r\n----------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n\r\n1 - filter(\"DATE_CLOSED\"&gt;=TO_DATE(' 2010-01-01 00:00:00',\r\n           'syyyy-mm-dd hh24:mi:ss') AND \"DATE_CLOSED\"&lt;=TO_DATE(' 2010-12-31\r\n           00:00:00', 'syyyy-mm-dd hh24:mi:ss'))\r\n<\/pre>\n<p>At 36,420 the optimizer\u2019s estimate is just about perfect. When sampling the data, Oracle found that one eleventh of the data in the range from (roughly) July 2012 to December 4000, and found 10\/11ths of the data spread evenly between Jan 2008 and July 2012, so the histogram gave it the information it needed to \u201cunderstand\u201d what the data really looked like in the date range I wanted (it wouldn&#8217;t be quite so accurate for dates after 20<sup>th<\/sup> July \u2013 which is why I\u2019d really want to use 254 buckets).<\/p>\n<p><strong>Note: most of the articles you see about histograms describe how they help the optimizer to deal with \u201clarge spikes\u201d or \u201cpopular values\u201d in the data. Histograms can help with ANY pattern in the data that is not boring, flat, unbroken and uniform. In this case the histogram told the optimizer about a big gap in the spread of the data.<\/strong><\/p>\n<p>Here\u2019s the content of the view user_tab_histograms for this data set (for dates, the stored value is the Julian version of the data):<\/p>\n<pre>\r\nENDPOINT_NUMBER ENDPOINT_VALUE EPV_CONVERTED\r\n--------------- -------------- -------------\r\n              0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2454467 \u00a0\u00a001-Jan-2008\r\n              1\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2454633 \u00a0\u00a015-Jun-2008\r\n              2\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2454799 \u00a0\u00a028-Nov-2008\r\n              3\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2454965 \u00a0\u00a013-May-2009\r\n              4\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2455132 \u00a0\u00a027-Oct-2009\r\n              5\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2455298 \u00a0\u00a011-Apr-2010\r\n              6\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2455464 \u00a0\u00a024-Sep-2010\r\n              7\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2455630 \u00a0\u00a009-Mar-2011\r\n              8\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2455797 \u00a0\u00a023-Aug-2011\r\n              9\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2455963 \u00a0\u00a005-Feb-2012\r\n             10\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 2456129 \u00a0\u00a020-Jul-2012\r\n             11\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 3182395 \u00a0\u00a031-Dec-4000\r\n<\/pre>\n<h2>We do not use dates<\/h2>\n<p>Some applications store dates either as 8 character strings in the format \u201cYYYYMMDD\u201d, or the numeric equivalent. This, conveniently, allows the dates to be human readable while keeping the correct sort order. Here\u2019s a sample data set \u2013 again running from Jan 2008 to Dec 2012, but this time only one row per day, and no \u201cpseudo-nulls\u201d<\/p>\n<pre>\r\ncreate table t1\r\n\r\nselect\r\n     d1,\r\n     to_number(to_char(d1,'yyyymmdd'))\u00a0\u00a0\u00a0 n1,\r\n     to_char(d1,'yyyymmdd')\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 v1\r\nfrom (\r\n     select\r\n          to_date('31-Dec-2007') + rownum d1\r\n     from all_objects\r\n     where\r\n          rownum &lt;= 1827\r\n     )\r\n;\r\n<\/pre>\n<p>I&#8217;m going to write something close to the the nastiest possible query I can to make a point \u2013 and I&#8217;m going to test it first without a histogram on any of the columns, then with a histogram in place. The query is simply going to ask: \u201chow many days are there between 30<sup>th<\/sup> Dec 2011 and 5<sup>th<\/sup> Jan 2012 \u2013 the well-informed DBA will work out that the answer is 7 (or possibly 5, or 6, depending on how they interpret \u201cbetween\u201d and dates). What will the optimizer think?<\/p>\n<p>Here\u2019s the answer when I query the <strong><em>date<\/em><\/strong> column:<\/p>\n<pre>\r\n----------------------------------------------------------\r\n| Id\u00a0 | Operation\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | Name | Rows\u00a0 | Bytes | Cost\u00a0 |\r\n----------------------------------------------------------\r\n|\u00a0\u00a0 0 | SELECT STATEMENT\u00a0 |\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0\u00a0 8 |\u00a0\u00a0 184 |\u00a0\u00a0\u00a0\u00a0 3 |\r\n|*\u00a0 1 |\u00a0 TABLE ACCESS FULL| T1\u00a0\u00a0 |\u00a0\u00a0\u00a0\u00a0 8 |\u00a0\u00a0 184 |\u00a0\u00a0\u00a0\u00a0 3 |\r\n----------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n1 - filter(\"D1\"&gt;=TO_DATE(' 2011-12-30 00:00:00', 'syyyy-mm-dd\r\n           hh24:mi:ss') AND \"D1\"&lt;=TO_DATE(' 2012-01-05 00:00:00', 'syyyy-mm-dd\r\n           hh24:mi:ss'))\r\n<\/pre>\n<p>It\u2019s pretty close \u2013 the error is due to the detail of the optimizer\u2019s handling of ranges where it adds (1\/num_distinct) to the selectivity to allow for the end points for bounded ranges.<\/p>\n<p>Then here\u2019s the equivalent (unless you\u2019re running a very old version of Oracle) for the <strong><em>varchar2()<\/em><\/strong> column:<\/p>\n<pre>\r\n----------------------------------------------------------\r\n| Id\u00a0 | Operation\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0| Name | Rows\u00a0 | Bytes | Cost\u00a0 |\r\n----------------------------------------------------------\r\n|\u00a0\u00a0 0 | SELECT STATEMENT\u00a0 |\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0 396 |\u00a0 9108 |\u00a0\u00a0\u00a0\u00a0 3 |\r\n|*\u00a0 1 |\u00a0 TABLE ACCESS FULL| T1\u00a0\u00a0 |\u00a0\u00a0 396 |\u00a0 9108 |\u00a0\u00a0\u00a0\u00a0 3 |\r\n----------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n\r\n   1 - filter(\"V1\"&gt;='20111230' AND \"V1\"&lt;='20120105')\r\n<\/pre>\n<p>The answer is out by a factor of 79 \u2013 not really acceptable, and likely to produce some bad execution plans. If you try the same query on the numeric column, by the way, you\u2019ll get the same result \u2013 I think there\u2019s a piece of \u201cspecial case\u201d optimizer code that uses the arithmetic for numeric data when handling character data the looks like numeric data (i.e. when the low, high, and predicate values all look like numbers).<\/p>\n<p>Where does the error come from ? Oracle can do date arithmetic, so it can calculate the ranges we need accurately:\u00a0 (5<sup>th<\/sup> Jan 2012 \u2013 30<sup>th<\/sup> Dec 2011 = 6 \u2026 and that\u2019s the figure that puts the calculation into the right ballpark. On the other hand, if you look at the 8-digit numeric column: 20120105 \u2013 20111230 = 8,875 you may begin to suspect that some strange cardinalities are likely to appear.<\/p>\n<p>If you think about the data as a series of number, you get (roughly) 30 consecutive numbers, then a gap of about 70 as you jump to the next month, and this repeats 11 time, then there\u2019s a gap of about 9,000 as you jump to the next year. The optimizer\u2019s cardinality predictions are going to be all over the place if it thinks your data is spread evenly between 20080131 and 20121231.<\/p>\n<p>So create a histogram \u00a0&#8211; I picked 120 buckets this time, we really need quite a lot of buckets to capture the detail around every single month (and on a production system I\u2019d use 254) \u2013 and let\u2019s see the effect:<\/p>\n<pre>\r\n----------------------------------------------------------\r\n| Id\u00a0 | Operation\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 | Name | Rows\u00a0 | Bytes | Cost\u00a0 |\r\n----------------------------------------------------------\r\n|\u00a0\u00a0 0 | SELECT STATEMENT \u00a0|\u00a0\u00a0\u00a0\u00a0\u00a0 |\u00a0\u00a0\u00a0 15 |\u00a0\u00a0 345 |\u00a0\u00a0\u00a0\u00a0 3 |\r\n|*\u00a0 1 |\u00a0 TABLE ACCESS FULL| T1\u00a0\u00a0 |\u00a0\u00a0\u00a0 15 |\u00a0\u00a0 345 |\u00a0\u00a0\u00a0\u00a0 3 |\r\n----------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   1 - filter(\"V1\"&gt;='20111230' AND \"V1\"&lt;='20120105')\r\n<\/pre>\n<p>The results are a lot better \u2013 not perfect, but possible sufficiently accurate that you get the plan you need automatically. Again, you might note, we haven\u2019t used a histogram to highlight \u201cspikes\u201d in the data \u2013 every single value appears the same number of times &#8211; we&#8217;ve used the histogram to show the optimizer the gaps in the data as you go from low to high.<\/p>\n<p>It\u2019s worth making one more point about this example \u2013 the ideal solution to this cardinality problem is to create a virtual column (perhaps through a function-based index in earlier versions of Oracle) that converts the stored data to the correct data type e.g.<\/p>\n<pre>\r\nalter table t1\r\nadd (\r\n     d1_virtual generated always as (to_date(v1,'yyyymmdd')) virtual\r\n)\r\n;\r\nbegin\r\n     dbms_stats.gather_table_stats(\r\n          user,\r\n          't1',\r\n          method_opt=&gt;'for columns d1_virtual size 1'\r\n     );\r\nend;\r\n\/\r\n<\/pre>\n<p>Provided you can modify the SQL to reference the virtual column (or match the function-based index) you&#8217;ve got a column of the right data type, with the right statistics to give the optimizer the best chance possible to calculate the correct cardinality.<\/p>\n<h2>Conclusion<\/h2>\n<p>In the absence of histograms the optimizer assumes that your data is evenly spread from the low value to the high value, with no gaps, and no spikes; if you could draw a graph of your data and see a continuous flat line then that\u2019s the image the optimizer has of your data. If a graph of your data shows big gaps, or a pattern that swings a long way from a flat line, then a histogram may help although, as we saw in part 2, they can be very sensitive to small changes and introduce instability to your execution paths.<\/p>\n<p>There are a couple of scenarios where histograms can be used to address poor application design \u2013 using extreme values instead of NULLs and storing data in the wrong data type (most significantly storing dates in character or numeric columns). \u00a0In these cases the ideal strategy is to correct the design flaw, but it\u2019s often not possible to change the database structure or application so a histogram may help the optimizer deal with the side effects of the design decision.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In part 2 we looked at the way that Oracle collects and uses height \u2013balanced histograms, and raised the problem of how sensitive they could be to small changes in the sampling, or in the data. We ended the article with a suggestion on how you could generate data for an \u201capproximate\u201d frequency histogram \u00a0that could be \u201cgood enough\u201d to&hellip;<\/p>\n","protected":false},"author":101205,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143533],"tags":[],"coauthors":[],"class_list":["post-73222","post","type-post","status-publish","format-standard","hentry","category-oracle-databases"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73222","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/101205"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=73222"}],"version-history":[{"count":1,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73222\/revisions"}],"predecessor-version":[{"id":91705,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73222\/revisions\/91705"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=73222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=73222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=73222"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=73222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}