{"id":73223,"date":"2013-10-01T16:04:21","date_gmt":"2013-10-01T16:04:21","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/uncategorized\/histograms-pt-2\/"},"modified":"2021-07-14T13:07:38","modified_gmt":"2021-07-14T13:07:38","slug":"histograms-pt-2","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/oracle-databases\/histograms-pt-2\/","title":{"rendered":"Histograms Part 2"},"content":{"rendered":"<p>In <a title=\"Histograms Part 1 \u2013 Why?\" href=\"https:\/\/allthingsoracle.com\/histograms-part-1-why\/\">part 1<\/a> of this series we discussed the reasons why we might want to create some histograms to help the optimizer deal with skewed data distribution. We used an example of a simple <strong><em>status<\/em><\/strong> column in an <strong><em>orders<\/em><\/strong> table to demonstrate principles, talked about frequency histograms in particular, and highlighted a couple of problems associated with histograms.<\/p>\n<p>In part 2 we are going to spend some time on the only other type of histogram that is available in versions of Oracle up to 11g, the height-balanced histogram. There are a few patterns of data where a height-based histogram can be useful, and we\u2019ll be looking at those in part 3; but we\u2019re going to start with an example where our data won\u2019t allow Oracle to create a frequency histogram and a height-balanced histogram can become more of a liability than a benefit; and we\u2019ll describe a strategy (that you will no longer need in 12c) for working around this limitation. Along the way we will look at a very simple example demonstrating the algorithm that Oracle uses to generate the contents of a height-based histogram.<\/p>\n<h2>A simple example<\/h2>\n<p>Here\u2019s a report of some important client data \u2013 camouflaged, of course \u2013 showing an uneven distribution of values for a particularly important column.<\/p>\n<pre>select\r\n         specifier, count(*)\r\nfrom\r\n         messages\r\ngroup by\r\n         specifier\r\norder by\r\n         count(*) desc\r\n;\u00a0\u00a0\u00a0 -- ca. 10M rows in table\r\n\r\nSPECIFIER\u00a0 COUNT(*)\r\nBVGFJB\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1,851,177\r\nLYYVLH\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0 719,582\r\nMTVMIE\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0 672,823\r\nYETSDP\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0 659,661\r\nDAJYGS\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0 504,641\r\n...\r\nKDCFVJ\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 75,328\r\nJITCRI\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 74,104\r\nDNRYKC\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 70,029\r\nBEWPEQ\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0 68,681\r\n...\r\nJXXXRE\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\nOHMNVU\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\nYGOBWQ\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\nUBBWQH\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\n\r\n352 rows selected.<\/pre>\n<p>Clearly I have a very uneven distribution of values here, so a histogram might be quite helpful. Unfortunately there are 352 distinct values, which is beyond the limit allowed by Oracle until 12c, so if I generate a histogram it will have to be a <strong><em>height-balanced<\/em><\/strong> histogram.<\/p>\n<p>To generate a height-balanced histogram Oracle starts by sorting the data into order then selects the first row and every Nth row (where N = &#8220;number of non-null values in column&#8221; divided by \u201cnumber of buckets requested\u201d). With our sample data we would be selecting roughly every 40,000th row from the sorted data. (10M \/ 254).<\/p>\n<p>Each selected number is referred to as an endpoint value, and the numbers are labelled with numbers\u00a0(called the endpoint numbers) from 0 to &#8220;number of buckets&#8221;. Using this algorithm a value that appears sufficiently frequently in the data set will be selected many times, which allows Oracle to recognize it as a &#8220;popular&#8221; value. When storing the histogram selection, Oracle doesn&#8217;t store repetitions of end point values \u2013 if a value appears many times in the list it will be stored just once with its highest endpoint number.<\/p>\n<p>To demonstrate the mechanism involved, and the subsequent cardinality arithmetic, I\u2019m going to take the following (ready-sorted) 20 values and build a histogram of 5 buckets:<\/p>\n<pre>5, 6, 6, 6, 9, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 16, 17, 17<\/pre>\n<p>Take the first and every 4th number (20 \/ 5 = 4)<\/p>\n<pre><span style=\"font-style: italic;font-weight: bold;text-decoration: underline\">5<\/span>, 6, 6, <span style=\"font-style: italic;font-weight: bold;text-decoration: underline\">6<\/span>, 9, 11, 11, <span style=\"font-style: italic;font-weight: bold;text-decoration: underline\">12<\/span>, 12, 12, 12, <span style=\"font-style: italic;font-weight: bold;text-decoration: underline\">12<\/span>, 13, 13, 13, <span style=\"font-style: italic;font-weight: bold;text-decoration: underline\">13<\/span>, 13, 16, 17, <span style=\"font-style: italic;font-weight: bold;text-decoration: underline\">17<\/span><\/pre>\n<p>Store them, labeling them from zero since we&#8217;ve included the lowest original value in our list:<\/p>\n<pre>0\u00a0\u00a0\u00a0 5\r\n1\u00a0\u00a0\u00a0 6\r\n2\u00a0\u00a0\u00a0 12\r\n3\u00a0\u00a0\u00a0 12\r\n4\u00a0\u00a0\u00a0 13\r\n5\u00a0\u00a0\u00a0 17<\/pre>\n<p>Now eliminate the earlier copies of any duplicated values:<\/p>\n<pre>0\u00a0\u00a0\u00a0 5\r\n1\u00a0\u00a0\u00a0 6\r\n3\u00a0\u00a0\u00a0 12\r\n4\u00a0\u00a0\u00a0 13\r\n5\u00a0\u00a0\u00a0 17<\/pre>\n<p>This is what Oracle would store as the height-balanced histogram on the original data set. From this it would infer that 12 was a popular value (it appeared more than once); but, even though we can see that there are just as many 12s as there are 13s, Oracle would not infer that 13 was a popular value. You might like to consider the effect of changing one of the 11s to 16 \u2013 if you did this then 12 would cease to appear as a popular value and 13 would become a popular value; conversely if you changed a 16 to 11 then neither 12 nor 13 would appear as popular.<\/p>\n<p>As far as cardinality calculations go, Oracle &#8220;counts buckets&#8221; for the popular values \u2013 in our tiny example we picked every 4th row, our \u201cbucket\u201d size is 4. We know that 12 appears twice in the full histogram, so Oracle calculates the cardinality of the predicate <em>\u201cn1 = 12\u201d<\/em> as 8 (i.e. 2 buckets * 4 rows per bucket). The method the optimizer uses to calculate the cardinality for the values which have not been recognized as popular has varied over time \u2013 in 11.2.0.3 one of the calculations is simply:\u00a0 (number of non-popular rows)\/(number of non-popular values) which works as follows in our example<\/p>\n<table>\n<tbody>\n<tr>\n<td>Popular values<\/td>\n<td>1<\/td>\n<td>(12)<\/td>\n<\/tr>\n<tr>\n<td>Non-popular values<\/td>\n<td>7<\/td>\n<td>(5,6,9,11,13,16,17)<\/td>\n<\/tr>\n<tr>\n<td>Popular rows<\/td>\n<td>8<\/td>\n<td>(2 buckets at 4 rows per bucket)<\/td>\n<\/tr>\n<tr>\n<td>Non-popular rows<\/td>\n<td>12<\/td>\n<td>(20 \u2013 8)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Non-popular cardinality = 12\/7 = 1.714, which rounds up to 2.<\/p>\n<p>There are various complications and variations to the calculations, of course; in particular if Oracle has used a sample to build the histogram the optimizer will need to scale up its calculations by a factor that reflects the difference between the number of relevant rows in the table and the number of rows captured in the histogram.<\/p>\n<p>With this small example in mind, let\u2019s consider the 10M rows and 352 values from my client.<\/p>\n<h2>Precision matters<\/h2>\n<p>We have 10M rows \u2013 and the first thing we need to do to create the height-balanced histogram is to sort the data, so the very first step is fairly resource intensive. If we take a sample this would reduce the volume of data sorted \u2013 but it\u2019s just possible that the sample would miss so many values that Oracle thought it could create a frequency histogram \u2013 and that might result in the optimizer making very low estimates of cardinality for values which actually have tens of thousands of rows.<\/p>\n<p>Having sorted the data, how many of our \u201cpopular\u201d values are likely to appear to be popular by the time we do the selection of every 40,000th row? Our most popular value has 1.8M rows, which accounts for 46 buckets, the next three values account for a further 63 buckets, leaving only 155 buckets \u2013 and since it takes two buckets to identify a popular value at all we\u2019re only going to be able to identify at most another 72 popular values \u2013 everything else will be treated as \u201caverage\u201d.<\/p>\n<p>In fact this is one of the really irritating features of the grey area between frequency and height-balanced histograms. A frequency histogram can handle 254 popular values, but if Oracle has to switch to a height-balanced histogram it can\u2019t identify more than 127 (i.e. 254\/2) popular values \u2013 if your data goes a little over the frequency limit your precision can drop dramatically.<\/p>\n<p>Stability is another problem \u2013 in the middle of my list I reported a number of values for which there were about 70,000 rows each. Are these going to be captured in the histogram? Given that a bucket represents 40,000 rows a value that has 40,001 rows can just appear twice in our selection, on the other hand a value that has 79,999 rows might just fail to appear twice. All you have to do is add or delete a few rows each day and values will \u201crandomly\u201d appear and disappear from the histogram each time you gather it. When I analysed the data set in detail I came to the following conclusions \u2013 for a 100% sample:<\/p>\n<ul>\n<li>There are 25 values that WILL get captured each day (more than 80,000 rows per value)<\/li>\n<li>There are 35 values that might get captured one day and disappear the next (40,000 to 80,000 rows)<\/li>\n<\/ul>\n<p>Height-balanced histograms can be very unstable \u2013 if any of those 35 values are the rows that users are really interested in then their execution plans may change dramatically every time the histogram is gathered. And the randomness of the sample makes things worse if we don\u2019t use 100% of the data to gather the histogram.<\/p>\n<p>But here\u2019s another aspect of analysing the data in detail:<\/p>\n<ul>\n<li>The top 140 values account for 99% of the data.<\/li>\n<li>The top 210 values account for 99.9% of the data<\/li>\n<li>The top 250 values account for 99.98% of the data<\/li>\n<li>The last 102 values account for about 2,000 rows from 10M<\/li>\n<\/ul>\n<p>(Remember, by the way, that we were only going to be able to capture at most 81 (77 + 3 + 1) popular values anyway because of the number of buckets the very popular values will cover.)<\/p>\n<p>We have 352 distinct values in the table, but 102 of them are virtually invisible. Wouldn&#8217;t it be nice if we could create a frequency histogram on the top 250 values (adding in the low and high values for the column, if necessary) and tell Oracle how to assume that any other values have an average of about 20 rows each. That\u2019s what 12c does, by the way, with its new <strong><em>top-N<\/em><\/strong> histogram, and it\u2019s what we can do with a few calls to the <strong><em>dbms_stats<\/em><\/strong> package if we really feel that it\u2019s important.<\/p>\n<h2>Faking the Frequency<\/h2>\n<p>Let\u2019s go back to the data set we used in the previous article to demonstrate how to construct a simple frequency histogram when we have a good idea of the values we want to use. Remember that we had a status column with a data distribution that looked like this:<\/p>\n<pre>S\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 COUNT(*)\r\nC\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 529,100\r\nP\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 300\r\nR\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 300\r\nS\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 300\r\nX\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 500,000<\/pre>\n<p>Since it\u2019s a very small set of values, and we\u2019re quite happy to say that this set of figures is a \u201cgood enough\u201d model of the data whenever we need to get the optimizer to run our critical queries, it\u2019s very easy to write a procedure to recreate a frequency histogram whenever Oracle\u2019s ordinary stats gathering mechanism creates new stats on this table.<\/p>\n<p>The code depends on three key procedures in the package: get_column_stats(), prepare_column_stats() and set_column_stats(), and assumes that table stats (of some description) have already been gathered since it reads the current stats for the column into local variables, adjusts the variables and then writes them back to the data dictionary.<\/p>\n<pre>declare\r\n\r\n     m_distcnt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 number;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- num_distinct\r\n     m_density\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 number;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- density\r\n     m_nullcnt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 number;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- num_nulls\r\n     m_avgclen\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 number;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- avg_col_len\r\n     srec\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dbms_stats.statrec;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- stats record\r\n     c_array\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dbms_stats.chararray;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- varchar2 array\r\n\r\nbegin\r\n\r\n     dbms_stats.get_column_stats(\r\n          ownname\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; user,\r\n          tabname\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; 'orders',\r\n          colname\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; 'status',\r\n          distcnt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_distcnt,\r\n          density\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_density,\r\n          nullcnt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_nullcnt,\r\n          srec \u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 =&gt; srec,\r\n          avgclen\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_avgclen\r\n     );\r\n\r\n     m_distcnt\u00a0 := 5;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- set my num_distinct\r\n     m_density\u00a0 := 0.00001;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- used for \u201cmissing values\u201d\r\n     srec.epc\u00a0\u00a0 := 5;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- number of buckets in histogram\r\n\r\n     c_array\u00a0\u00a0\u00a0 := dbms_stats.chararray(\r\n                     rpad('C',15),\r\n                     rpad('P',15),\r\n                     rpad('R',15),\r\n                     rpad('S',15),\r\n                     rpad('X',15)\r\n                 );\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- list of values \u2013 in order\r\n     srec.bkvals\u00a0\u00a0\u00a0\u00a0 := dbms_stats.numarray(\r\n                          529100,\r\n                             300,\r\n                             300,\r\n                             300,\r\n                          500000\r\n                 );\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- corresponding list of frequencies\r\n\r\n     dbms_stats.prepare_column_values(srec, c_array);\r\n\r\n     dbms_stats.set_column_stats(\r\n          ownname\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; user,\r\n          tabname\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; 'orders',\r\n          colname\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; 'status',\r\n          distcnt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_distcnt,\r\n          density\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_density,\r\n          nullcnt\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_nullcnt,\r\n          srec \u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0  =&gt; srec,\r\n          avgclen\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =&gt; m_avgclen\r\n     );\r\nend;\r\n\/<\/pre>\n<p>There are a couple of key details that are worth mentioning. First, in the previous article, I said that Oracle uses \u201chalf the least popular frequency\u201d for predicates involving values that are missing from the histogram; it seems that when you use set_column_sets() to create stats, Oracle uses the density to calculate the cardinality of missing values \u2013 which is why I have set the density in this code, I\u2019ve decided that if you query for a missing value its cardinality should be 10. (If you check view <strong><em>user_tab_columns<\/em><\/strong> the only difference between gathering and setting stats is that the column <strong><em>user_stats<\/em><\/strong> is set to YES when you set stats, and NO when you gather them.)<\/p>\n<p>Secondly, you\u2019ll notice that for my list of values I&#8217;ve right-padded each value to 15 characters with spaces. My table definition had <strong><em>status<\/em><\/strong> as a char() column, and Oracle treats char() and varchar2() slightly differently when converting strings to numbers \u2013 for char() columns you must rpad() to 15 characters like this, for varchar2() you must not pad. If you get this wrong the calculations will be wrong.<\/p>\n<p>Finally, I have used an array of type dbms_stats.chararray; there are several other array types defined for the dbms_stats package (e.g. numarray, datearray) and you need to choose the type that is appropriate for the column you are hacking.<\/p>\n<h2>Choosing your stats<\/h2>\n<p>So if you know what values to use and their frequencies you can easily set up a fake frequency histogram. If you don\u2019t know the data very well you might want to derive the histogram information from the data. My view on histograms is that you really ought to have only a very small number, they\u2019re generally frequency histograms, and you usually need to know the data, so I&#8217;ve never written a generic (hence complicated) script to deal with every option; I have a framework I use and create a custom procedure for each column that needs it. The core of the procedure is basically a simple SQL statement that identifies the minimum and maximum values and the top N most frequently occurring values along with their frequencies, and some high-level summary information; here\u2019s a sample query with sample output:<\/p>\n<pre>select\r\n     sample_size\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sample,\r\n     sum(ct) over()\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 hist_size,\r\n     count(ct) over()\u00a0\u00a0\u00a0\u00a0\u00a0 buckets,\r\n     min_n1,\r\n     max_n1,\r\n     n1,\r\n     ct\r\nfrom (\r\n     select\r\n           sum(ct) over ()\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sample_size,\r\n           min(n1) over()\u00a0 min_n1,\r\n           max(n1) over()\u00a0 max_n1,\r\n           n1,\r\n           ct\r\n     from (\r\n           select\r\n                n1,\r\n                count(*) \u00a0 ct\r\n           from\r\n                t1\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 -- could use a sample clause here\r\n           group by\r\n                n1\r\n           )\r\n     order by\r\n          ct desc\r\n     )\r\nwhere\r\n     rownum &lt;= 20\r\nor\u00a0\u00a0 min_n1 = n1\r\nor\u00a0\u00a0 max_n1 = n1\r\norder by\r\n     n1\r\n;\r\n\r\n   SAMPLE\u00a0 HIST_SIZE\u00a0\u00a0\u00a0 BUCKETS\u00a0\u00a0\u00a0\u00a0 MIN_N1\u00a0\u00a0\u00a0\u00a0 MAX_N1\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 N1\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 CT\r\n---------- ---------- ---------- ---------- ---------- ---------- ----------\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 190\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 233\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a022\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 191\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 259\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 192\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 268\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 193\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 300\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 194\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 317\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 195\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 312\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 196\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 394\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 197\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 415\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 198\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 389\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 199\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 406\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 200\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 777\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a022\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 201\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 400\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 202\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 405\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 203\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 393\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 204\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 358\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 205\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 315\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 206\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 306\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 207\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 284\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 208\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 276\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 209\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 253\r\n   10000\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 7062\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 22\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 162\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 239\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\n\r\n22 rows selected.<\/pre>\n<p>The innermost part of the query is the aggregate by column value that we need, and which we hope will reduce a large data set to a small result set. We then order this by column frequency, using a couple of analytic functions to bring in the values corresponding to the minimum and maximum column values. Then we select the rows corresponding to the top N frequencies and the minimum and maximum column values. Finally we use a couple of analytic functions to count the actual number of buckets we&#8217;ve selected and sum the total volume of data in the histogram, returning the results in order of column value.<\/p>\n<p>The difference between the sample value and the hist_size tells us whether this is likely to be a \u201cgood enough\u201d approximation to the whole data set. The buckets value tells us whether we&#8217;ve had to add the minimum and maximum values to the list or whether they naturally fell into our Top-N, and the min_n1, max_n1 columns tell us the actual minimum and maximum values for the columns. The N1 and CT columns give us the details that (with a little manipulation, perhaps) we\u2019re going to put into the two dbms_stat arrays.<\/p>\n<p>Once we have a query like this, we need do little more than change our previous version of the procedure to get rid of the array assignments, replacing them with a cursor for loop:<\/p>\n<pre>for r in\u00a0 ({query}) loop\r\n     n_array.extend;\r\n     srec.bkvals.extend;\r\n     ct := ct+1;\r\n     n_array(ct)\u00a0\u00a0\u00a0\u00a0 := r.n1;\r\n     srec.bkvals(ct) := r.ct;\r\nend loop;<\/pre>\n<p>There are a number of little details needed around the edges to make this a complete solution \u2013 initialising variables, deciding on a suitable density, deciding how to handle the case where the minimum and maximum values didn&#8217;t appear in the Top-N, deciding what to do if the volume of data captured is too small compared to the sample size, and so on. Further refinements are left as an exercise to the reader.<\/p>\n<h2>Conclusion<\/h2>\n<p>Sometimes you have to do some fairly subtle things to get the best out of Oracle, and when you\u2019re dealing with histograms you\u2019re in an area where you\u2019re most likely to need some custom mechanisms to get things right \u2013 but probably only for a few special cases.<\/p>\n<p>One special case appears when you have a column that holds more distinct values than Oracle currently allows for a frequency histogram (254) and you find that a height-balanced histogram doesn&#8217;t capture as much information as you would like and, quite possibly, gives you changes in execution plans whenever you gather statistics. In a case like this you may decide to construct a \u201cfake\u201d frequency histogram that holds details of the most popular values in the table and also tells the optimizer how to work out a representative cardinality for the rest of the values.<\/p>\n<p>If you know your data well you could set up a procedure that uses a fixed set of numbers for the histogram, otherwise you could query the data to generate a suitable \u201ctop N\u201d set of stats. Whichever you choose to do, you need to remember that the optimizer will scale up the size of the histogram you&#8217;ve generated to match the value of (<strong><em>user_tables.num_rows<\/em><\/strong> \u2013 <strong><em>user_tab_cols.num_nulls<\/em><\/strong>), so you should use this technique only when the popular values represent a very large percentage (perhaps as much as 98%) of the total volume of data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In part 1 of this series we discussed the reasons why we might want to create some histograms to help the optimizer deal with skewed data distribution. We used an example of a simple status column in an orders table to demonstrate principles, talked about frequency histograms in particular, and highlighted a couple of problems associated with histograms. In part&hellip;<\/p>\n","protected":false},"author":101205,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143533],"tags":[],"coauthors":[],"class_list":["post-73223","post","type-post","status-publish","format-standard","hentry","category-oracle-databases"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73223","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/101205"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=73223"}],"version-history":[{"count":1,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73223\/revisions"}],"predecessor-version":[{"id":91706,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73223\/revisions\/91706"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=73223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=73223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=73223"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=73223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}