{"id":73136,"date":"2016-01-26T15:16:24","date_gmt":"2016-01-26T15:16:24","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/uncategorized\/12c-hybrid-histogram\/"},"modified":"2021-07-14T13:07:18","modified_gmt":"2021-07-14T13:07:18","slug":"12c-hybrid-histogram","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/oracle-databases\/12c-hybrid-histogram\/","title":{"rendered":"12c Hybrid histogram"},"content":{"rendered":"<p>In the previous article about the <a href=\"https:\/\/allthingsoracle.com\/12c-histogram-top-frequency\/\">Oracle 12c TOP-Frequency<\/a> histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in\u00a0the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram type which occurs when the data set fails to qualify for a TOP-Frequency. Firstly, it shows that the method used by Oracle to compute a Hybrid histogram is strongly related to the TOP-Frequency preliminary determination step. It then shows that a column value can be <strong>popular<\/strong>, <strong>non-popular with an endpoint value<\/strong> or <strong>non-popular without an endpoint value<\/strong>. It then examines the formulas used by the optimizer to compute its cardinality estimation for each of the above three cases.<\/p>\n<h2>Natural or Adjusted?<\/h2>\n<p>In order to investigate the method used by Oracle to calculate a Hybrid histogram, I am going to use a data set example taken from <a href=\"https:\/\/jonathanlewis.wordpress.com\/2013\/10\/09\/12c-histograms-pt-3\/\">Jonathan Lewis<\/a>&#8216; blog as shown below:<\/p>\n<pre>SQL&gt; create table t1 (id number, n1 number);\r\n\r\nSQL&gt; start InsT1.sqlScript (<a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/oracle\/2016\/01\/InsT1.zip\" rel=\"\">see downloadable script at the end<\/a>)<\/pre>\n<p>In the following, I am going to gather Hybrid histogram in 12c (<span style=\"color: red;\">12.1.0.2.0<\/span>) by specifying fewer buckets (20) than distinct value (37) for the current data set and using default sampling:<\/p>\n<pre>SQL&gt; exec dbms_stats.set_global_prefs ('TRACE', to_char (1+16)); \r\nSQL&gt; BEGIN\r\n        dbms_stats.gather_table_stats\r\n          (user, 't1', method_opt =&gt; 'for columns n1 size <span style=\"color: red;\">20<\/span>');\r\n     END;\r\n    \/\r\nSQL&gt; exec dbms_stats.set_global_prefs('TRACE', null);<\/pre>\n<p>As in the previous article, the above call to dbms_stats package has been traced so that we can explore its content in the following snippet:<\/p>\n<pre>DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME\r\nDBMS_STATS:   Y    Y    Y    Y    Y    Y    Y    Y                   Y    N1\r\nDBMS_STATS: Approximate NDV Options\r\nDBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U20U\r\nDBMS_STATS: start processing top n values for column \"N1\"\r\nDBMS_STATS:   &gt;&gt; <span style=\"color: red;\">frequency histograms is not feasible<\/span>\r\n                       <span style=\"color: red;\">(topn_values is null), skip!<\/span>\r\nDBMS_STATS: Iteration: 1 numhist: 1\r\nDBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME\r\nDBMS_STATS:                       Y    Y    Y    Y                   Y    N1\r\nDBMS_STATS: <span style=\"color: red;\">Building Histogram for N1<\/span>\r\nDBMS_STATS:  bktnum=20, nnv=100, snnv=100, sndv=37, est_ndv=37, mnb=20\r\nDBMS_STATS:  Trying hybrid histogram<\/pre>\n<p>As expected, Oracle started by carrying out the usual TOP-Frequency preliminary determination step. It is worth emphasizing that this data set fails to immediately qualify for a TOP-Frequency histogram because the TOP-20 values account for 80 rows while the threshold is 95 rows, as shown below:<\/p>\n<pre>SQL&gt; select\r\n         sum (cnt) TopNRows\r\n    from (select\r\n            n1\r\n           ,count(*) cnt\r\n         from t1\r\n         group by n1\r\n          order by count(*) desc\r\n          )\r\n   where rownum &lt;= 20;\r\n\r\n  TOPNROWS\r\n----------\r\n        80\r\n\r\nSQL&gt; select round ((20-1)\/20 * 100) threshold from dual;\r\n\r\n THRESHOLD\r\n----------\r\n        95<\/pre>\n<p>This is what I tend to qualify as a <strong><em>&#8220;Natural Hybrid&#8221;<\/em><\/strong> histogram representing a data set which doesn&#8217;t qualify for a TOP-Frequency because the threshold is greater than the naturally non-adjusted <code>TopNRows<\/code>, as explained in the previous article. However, what happens for a data set that initially satisfies the TOP-Frequency threshold but\u00a0fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the <code>Adjusted TopNRows<\/code> is greater than the threshold. A dbms_stats trace file of such a kind of data set shows a different content:<\/p>\n<pre>DBMS_STATS: start processing top n values for column \"N1\"\r\nDBMS_STATS:   <span style=\"color: red;\">&gt;&gt; frequency histograms is not feasible<\/span>\r\n                       <span style=\"color: red;\">(topn_values is null), skip!<\/span>\r\nDBMS_STATS: <span style=\"color: red;\">start processing top n values for column \"N2\"<\/span>\r\nDBMS_STATS: topn sql (len: 589):\r\nDBMS_STATS: +++ select \/*+  no_parallel(t) no_parallel_index(t) \r\n                           dbms_stats cursor_sharing_exact use_weak_name_resl\r\n                           dynamic_sampling(0) no_monitoring \r\n\t\t\t\t xmlindex_sel_idx_tbl no_substrb_pad  \r\n                        *\/ \r\n\t           substrb(dump(\"N2\",16,0,64),1,240) val\r\n                 ,rowidtochar(rowid) rwid \r\n\t          from \"XXXX\".\"T1\" t \r\n\t\t    where rowid in  \r\n              (chartorowid('AAAJmWAAEAACN1zAAA'),chartorowid('AAAJmWAAEAACN1zAAB')\r\n\t        ,chartorowid('AAAJmWAAEAACN1zAAC'),chartorowid('AAAJmWAAEAACN1zAAF' \r\n\t        ,chartorowid('AAAJmWAAEAACN1zAAG'),chartorowid('AAAJmWAAEAACN1zAAH')\r\n\t        ,chartorowid('AAAJmWAAEAACN1zAAJ'),chartorowid('AAAJmWAAEAACN1zAAU')) \r\n\t        order by \"N2\"\t   \r\nDBMS_STATS: <span style=\"color: red;\">remove last bucket: Typ=2 Len=2: c1,2b add: Typ=2 Len=3: c2,2,1a<\/span>\r\nDBMS_STATS: removal_count: 1 total_nonnull_rows: 200 mnb:  8\r\nDBMS_STATS: <span style=\"color: red;\">Abort top-n histogram, as the addition of min\/max does not preserve the minimum coverage: .85225054 vs. .91<\/span>\r\nDBMS_STATS:  Building Histogram for N2\r\nDBMS_STATS:  Trying hybrid histogram<\/pre>\n<p>While in the first case Oracle immediately abandoned the Top-Frequency path, in the second case it went a little bit deeper before aborting the top-frequency process in favor of a Hybrid histogram. This is what I tend to qualify as the <strong><em>&#8220;Adjusted Hybrid&#8221;<\/em><\/strong> histogram representing a data set that fails to qualify for a TOP-Frequency because Oracle finds that the threshold is greater than the Adjusted TopNRows in the middle of the process.<\/p>\n<p>The method used by Oracle to compute Hybrid histogram depends on whether the data set is of a <em>&#8220;Natural Hybrid&#8221;<\/em> type or of an <em>&#8220;Adjusted Hybrid<\/em>&#8221; one. For the later type Oracle will take profit of the pre-process it has accomplished and transform the aborted top-frequency into a hybrid histogram. In the former case Oracle will use a bucket-frequency strategy.<\/p>\n<p>The data set used in this article qualifies for a <em>&#8220;Natural Hybrid&#8221;<\/em> histogram. This is why the next section explains the method used by Oracle to collect this kind of hybrid histogram. The explanation of the &#8220;<em>Adjusted Hybrid<\/em>&#8221; method will be left to a separate article.<\/p>\n<h2>The method<\/h2>\n<p>Oracle has computed a Hybrid histogram for the current data set using a bucket size of 5 (sample_size\/num_buckets = 100\/20), as shown below:<\/p>\n<pre>SQL&gt; select\r\n       column_name\r\n      ,num_distinct\r\n      ,num_buckets\r\n      ,sample_size\r\n      ,histogram\r\n    from\r\n       user_tab_col_statistics\r\n    where table_name = 'T1'\r\n    and column_name  ='N1';\r\n\r\nCOLUMN_NAME  NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM\r\n------------ ------------ ----------- ----------- -----------\r\nN1                  37          20         100     HYBRID<\/pre>\n<p>At this stage of the article, let&#8217;s label this bucket size as the <strong><em>&#8220;original&#8221;<\/em><\/strong> bucket size. The method used by Oracle to compute this kind of histogram seems to obey the following strategy:<\/p>\n<ol>\n<li>a distinct value should not be found in more than one bucket<\/li>\n<li>a bucket size is allowed to be extended in order to contain all instances of the same distinct value<\/li>\n<\/ol>\n<p>As long as the bucket size can be extended so that no distinct value will span multiple buckets, deciphering the <strong><em>new<\/em><\/strong> size of the extended buckets represents a crucial step in figuring out the method used by Oracle to compute Hybrid histogram. Below is the histogram information stored in the data dictionary:<\/p>\n<pre>SQL&gt; SELECT\r\n        (row_number() over(order by ept_nbr)-1) NumBucket\r\n        ,ept_nbr\r\n        ,ept_act_val\r\n        ,rpt_cnt\r\n        ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) \"new bucket size\"\r\n        ,bucket_size \"original bucket_size\"\r\n    FROM\r\n        (SELECT\r\n             ah.endpoint_number            ept_nbr\r\n            ,ah.endpoint_actual_value      ept_act_val\r\n            ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag\r\n            ,ah.endpoint_repeat_count rpt_cnt\r\n            ,at.sample_size\/at.num_buckets bucket_size\r\n         FROM\r\n            user_tab_histograms      ah\r\n           ,user_tab_col_statistics  at\r\n         WHERE ah.table_name  = at.table_name\r\n         AND ah.column_name = at.column_name\r\n         AND ah.table_name  = 'T1'\r\n         AND ah.column_name = 'N1'\r\n       ) ORDER BY ept_nbr;\r\n\r\n NUMBUCKET    EPT_NBR EPT_ACT_VA    RPT_CNT <span style=\"color: red;\">new<\/span> bucket size <span style=\"color: red;\">original<\/span> bucket_size\r\n---------- ---------- ---------- ---------- --------------- --------------------\r\n         0          1 8                   1               1                    5\r\n         1          6 13                  3               5                    5\r\n         2         12 18                  2               6                    5\r\n         3         20 20                  5               8                    5\r\n         4         26 23                  2               6                    5\r\n         5         32 26                  3               6                    5\r\n         6         38 27                  6               6                    5\r\n         7         44 28                  6               6                    5\r\n         8         50 29                  6               6                    5\r\n         9         58 31                  5               8                    5\r\n        10         69 33                  8              11                    5\r\n        11         79 35                  7              10                    5\r\n***     12         86 38                  5               7                    5\r\n        13         90 41                  1               4                    5\r\n        14         92 42                  2               2                    5\r\n        15         95 43                  3               3                    5\r\n        16         96 44                  1               1                    5\r\n        17         97 45                  1               1                    5\r\n        18         98 46                  1               1                    5\r\n        19        100 59                  1               2                    5 \r\n20 rows selected.<\/pre>\n<p>Excluding the boundaries, and up to the 13<sup>th<\/sup> endpoint number 86, all <strong><em>new<\/em><\/strong> bucket sizes are <strong>greater or equal<\/strong> to the <strong><em>original<\/em><\/strong> bucket size. In order to have a clue about what&#8217;s happening behind the scenes, let&#8217;s draw an ordered picture of the data set divided into 20 buckets each of size 5 as shown below:<\/p>\n<pre>8 12 12 13 13 |<span style=\"color: red;\">13<\/span> 15 16 16 17 |18 18 19 19 19|20 20 20 20 20|21 22 22 22 23|<span style=\"color: red;\">23<\/span> 24 24 25 26|\r\n\r\n<span style=\"color: red;\">26<\/span> 26 27 27 27|<span style=\"color: red;\">27<\/span> 27 27 28 28 |<span style=\"color: red;\">28<\/span> 28 28 28 29|29 29 29 29 29|30 30 30 31 31|31 31 31 32 32|\r\n\r\n<span style=\"color: red;\">32<\/span> 33 33 33 33|<span style=\"color: red;\">33<\/span> 33 33 33 34 |<span style=\"color: red;\">34<\/span> 34 35 35 35|<span style=\"color: red;\">35<\/span> 35 35 35 36|37 38 38 38 38|<span style=\"color: red;\">38<\/span> 39 39 40 41|\r\n\r\n42 42 43 43 43|44 45 46 50 59|<\/pre>\n<p>Oracle starts walking the above ordered data set until it finds that value <strong><span style=\"color: red;\">13<\/span> <\/strong>traverses two buckets. Based on the first rule above, Oracle extends bucket n\u00b01 by moving all instances of <strong><span style=\"color: red;\">13<\/span><\/strong> from bucket_id n\u00b02 into it, giving us these intermediary results:<\/p>\n<pre>8 12 12 13 13 <span style=\"color: red;\">13<\/span>| 15 16 16 17 |18 18 19 19 19|<\/pre>\n<p>Note that 13 is not only an endpoint number of bucket n\u00b01 but since it is uniquely located into a single bucket, Oracle is able to count its frequency of 3 (endpoint_repeat_count) as shown in the histogram information reproduced below:<\/p>\n<pre>NUMBUCKET    EPT_NBR EPT_ACT_VA    RPT_CNT <span style=\"color: red;\">new<\/span> bucket size <span style=\"color: red;\">original<\/span> bucket_size\r\n---------- ---------- ---------- ---------- --------------- --------------------\r\n         1          6 <span style=\"color: red;\">13<\/span>                  <span style=\"color: red;\"><strong>3<\/strong><\/span>               5                    5\r\n         2         12 18                  2               6                    5<\/pre>\n<p>However, the above histogram information is showing that bucket n\u00b02 is of size 6 and has 18 as its endpoint number, with a frequency of 2. This is not in accordance with this picture:<\/p>\n<pre>8 12 12 13 13 <span style=\"color: red;\">13<\/span>| 15 16 16 17 |18 18 19 19 19|<\/pre>\n<p>Looking at this new situation we can spot that when jumping from bucket n\u00b02 to bucket n\u00b03 there are no distinct values crossing multiple buckets. So we could keep this bucket with its new size of 4. This, however, is not the case. In fact, it appears that there is a third rule used by Oracle when generating &#8220;<em>Natural Hybrid<\/em>&#8221; histogram which is:<\/p>\n<ol start=\"3\">\n<li>do not create buckets with a size less than the original size (5) \u2013not applicable at the end of the data set\u2013<\/li>\n<\/ol>\n<p>As long as the above rule is applied, with bucket n\u00b02 being of size 4, it has to be extended to reach a size of 5. This is why the first instance from bucket n\u00b03 (18) has to be moved into bucket n\u00b02 as in the following:<\/p>\n<pre>8 12 12 13 13 13| 15 16 16 17 18| 18 19 19 19|<\/pre>\n<p>However, since there are two instances of 18 in bucket n\u00b03 they have to move together to obey the first rule in the strategy: no distinct values spanning multiple buckets. This gives us the following situation:<\/p>\n<pre>8 12 12 13\u00a013 <span style=\"color: red;\">13<\/span>| 15 16 16\u00a017 <span style=\"color: red;\">18 18<\/span>| 19 19 19<\/pre>\n<p>As a result of this, we see that we do indeed have a bucket n\u00b02 of size 6, with 18 as its endpoint number and a frequency of 2. We will see later in this article that it is this rule 3) of the may introduce instability in the Hybrid histogram.<\/p>\n<p>Oracle will continue traversing the rest of the data set applying the three point strategy above until it finds that value <span style=\"color: red;\"><strong>23<\/strong><\/span> spans multiple buckets:<\/p>\n<pre>8 12 12 13\u00a013 <span style=\"color: red;\">13<\/span>| 15 16 16\u00a017 <span style=\"color: red;\">18 18<\/span>| 19 19 19 <span style=\"color: red;\">20 20 20 20 20<\/span>|-|21 22 22 22 23|<span style=\"color: red;\">23<\/span> 24 24 25 26|<\/pre>\n<p>Applying the same rule again, all instances of values <span style=\"color: red;\"><strong>23<\/strong><\/span> will be moved from bucket _id n\u00b06 to bucket_id n\u00b05, resulting in the following situation:<\/p>\n<pre>8 12 12 13\u00a013 <span style=\"color: red;\">13<\/span>| 15 16 16\u00a017 <span style=\"color: red;\">18 18<\/span>| 19 19 19 <span style=\"color: red;\">20 20 20 20 20<\/span>|-|21 22 22 22 23 <span style=\"color: red;\">23<\/span>| 24 24 25 26|<\/pre>\n<p>And so on until it reaches the end of the data set producing such a final histogram result:<\/p>\n<pre>8 12 12 13 13 <span style=\"color: red;\">13<\/span>|15 16 16 17 <span style=\"color: red;\">18 18<\/span>|19 19 19 <span style=\"color: red;\">20 20 20 20 20<\/span>|-|21 22 22 22 23 <span style=\"color: red;\">23<\/span>|24 24 25 26 <span style=\"color: red;\">26<\/span> 26|27 27 27 <span style=\"color: red;\">27<\/span> 27 27|28 28 <span style=\"color: red;\">28<\/span> 28 28 28 |29 29 29 29 29 29|-|30 30 30 <span style=\"color: red;\">31 31 31 31 31<\/span>|32 32 <span style=\"color: red;\">32 33 33 33 33 33 33 33 33<\/span>|-|34 <span style=\"color: red;\">34<\/span> 34 35 35 35 <span style=\"color: red;\">35<\/span> 35 35 35|-|36 37 <span style=\"color: red;\">38 38 38 38 38<\/span>|-|39 39 40 41|42 42 |43 43 43|44|45|46|50 59|<\/pre>\n<p>You will notice that, starting from the thirteenth buckets at value 39, the bucket extension rule and minimum size requirement seem to be ignored. At this stage of the data set Oracle is left with 14 distinct values to spread over 8 buckets. This last phrase paved the way to what seems to be the fourth rule applied by Oracle when gathering <em>Natural Hybrid<\/em> histogram:<\/p>\n<ol start=\"4\">\n<li>the original number of buckets (20) should not be reduced<\/li>\n<\/ol>\n<p>Instead of reducing the original number of buckets (20), Oracle seems to divide the number of remaining distinct values (14) by the number of remaining buckets (8) and spread them accordingly. Unfortunately, I didn&#8217;t succeed in working out the exact strategy used by Oracle in this last step to spread the data across the remaining buckets, rather than the one which consists of keeping the initial number of buckets unchanged.<\/p>\n<h2>Popular or non-popular?<\/h2>\n<p>Oracle developers and tuning specialists are interested in having optimal execution plans. When the Optimizer is supplied with accurate statistics it will mostly come up with an optimal execution plan. However, when the optimizer uses a histogram to get its cardinality estimate, the arithmetic it employs for a popular value differs from that used for a non-popular value. This is why identifying the popularity of a column value represents a crucial step in the tuning process. This section of the article shows when a column with a hybrid histogram is considered popular or not and shows the arithmetic used by Oracle to compute the cardinality estimate for each case.<\/p>\n<p>When it comes to a hybrid histogram Oracle considers a value to be popular when its recorded endpoint_repeat_count value exceeds the <em>original<\/em> bucket size. The endpoint_repeat_count is a new 12c histogram column property in which Oracle stores the number of times a distinct value has been repeated in a bucket. Let&#8217;s put this in a SQL format:<\/p>\n<pre>SQL&gt; select\r\n         uth.endpoint_number\r\n        ,uth.endpoint_actual_value\r\n        ,uth.endpoint_repeat_count\r\n        ,ucs.sample_size\/ucs.num_buckets bucket_size      \r\n        ,(uth.endpoint_repeat_count - ucs.sample_size\/ucs.num_buckets) Popularity\r\n    from\r\n        user_tab_histograms uth\r\n       ,user_tab_col_statistics ucs\r\n   where\r\n        uth.table_name   = ucs.table_name\r\n    and uth.column_name   = ucs.column_name\r\n    and uth.table_name    = 'T1'\r\n    and uth.column_name   = 'N1'\r\n   order by uth.endpoint_number;\r\n\r\nENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY\r\n--------------- ---------- --------------------- ----------- ----------\r\n              1 8                              1           5         -4\r\n              6 13                             3           5         -2\r\n             12 18                             2           5         -3\r\n             20 20                             5           5          0\r\n             26 23                             2           5         -3\r\n             32 26                             3           5         -2\r\n             38 27                             6           5          1\r\n             44 28                             6           5          1\r\n             50 29                             6           5          1\r\n             58 31                             5           5          0\r\n             69 33                             8           5          3\r\n             79 35                             7           5          2\r\n             86 38                             5           5          0\r\n             90 41                             1           5         -4\r\n             92 42                             2           5         -3\r\n             95 43                             3           5         -2\r\n             96 44                             1           5         -4\r\n             97 45                             1           5         -4\r\n             98 46                             1           5         -4\r\n            100 59                             1           5         -4\r\n20 rows selected.<\/pre>\n<p>The <strong>Popularity<\/strong> column above represents the difference between the frequency of appearance of each column and the bucket size (sample_size\/num_buckets). It can easily be inferred from this column that a value is considered popular when its <strong>Popularity<\/strong> column value is greater than zero. Refactoring the above query a little gives this:<\/p>\n<pre>SQL&gt; select\r\n         endpoint_number\r\n        ,endpoint_actual_value\r\n        ,endpoint_repeat_count\r\n        ,bucket_size\r\n        ,case when Popularity &gt; 0 then 'Pop'\r\n                   else 'Non-Pop'\r\n          end Popularity\r\n    from\r\n   (\r\n     select\r\n         uth.endpoint_number\r\n        ,uth.endpoint_actual_value\r\n        ,uth.endpoint_repeat_count\r\n        ,ucs.sample_size\/ucs.num_buckets bucket_size      \r\n        ,(uth.endpoint_repeat_count - ucs.sample_size\/ucs.num_buckets) Popularity\r\n    from\r\n        user_tab_histograms uth\r\n       ,user_tab_col_statistics ucs\r\n   where\r\n        uth.table_name   = ucs.table_name\r\n    and uth.column_name   = ucs.column_name\r\n    and uth.table_name    = 'T1'\r\n    and uth.column_name   = 'N1'\r\n    )\r\n   order by endpoint_number;\r\n\r\nENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR\r\n--------------- ---------- --------------------- ----------- -------\r\n              1 8                              1           5 Non-Pop\r\n              6 13                             3           5 Non-Pop\r\n             12 18                             2           5 Non-Pop\r\n             20 20                             5           5 Non-Pop\r\n             26 23                             2           5 Non-Pop\r\n             32 26                             3           5 Non-Pop\r\n             38 27                             6           5 Pop\r\n             44 28                             6           5 Pop\r\n             50 29                             6           5 Pop\r\n             58 31                             5           5 Non-Pop\r\n             69 33                             8           5 Pop\r\n             79 35                             7           5 Pop\r\n             86 38                             5           5 Non-Pop\r\n             90 41                             1           5 Non-Pop\r\n             92 42                             2           5 Non-Pop\r\n             95 43                             3           5 Non-Pop\r\n             96 44                             1           5 Non-Pop\r\n             97 45                             1           5 Non-Pop\r\n             98 46                             1           5 Non-Pop\r\n            100 59                             1           5 Non-Pop\r\n20 rows selected.<\/pre>\n<p>Simply put, a captured Hybrid histogram value is considered to be:<\/p>\n<ul>\n<li>popular when its endpoint_repeat_count is greater than the <strong>original<\/strong> bucket size<\/li>\n<li>non-popular with a recorded endpoint number in the histogram<\/li>\n<li>non-popular and absent from the histogram information<\/li>\n<\/ul>\n<p>Thus, in contrast to its predecessor, the Height Balance histogram, it&#8217;s not sufficient for a Hybrid value to be captured into the histogram with an endpoint_repeat_count greater than 1 in order for it to be considered popular.<\/p>\n<h2>Cardinality estimation<\/h2>\n<p>As stated above the popularity of Hybrid histogram value dictates the arithmetic used by Oracle to calculate its estimation. The preceding section illustrates three types of Hybrid histogram values:<\/p>\n<ul>\n<li>popular value<\/li>\n<li>non-popular value with an endpoint number<\/li>\n<li>non-popular value not present in the histogram table<\/li>\n<\/ul>\n<p>Let&#8217;s start by figuring out how Oracle estimates the cardinality of the first type:<\/p>\n<h3>Popular value<\/h3>\n<p>Below is a list of popular values in the current data set:<\/p>\n<pre>SQL&gt; select\r\n         endpoint_actual_value\r\n        ,endpoint_repeat_count\r\n        ,bucket_size\r\n        ,Popularity\r\nfrom\r\n(\r\n  select\r\n         endpoint_number\r\n        ,endpoint_actual_value\r\n        ,endpoint_repeat_count\r\n        ,bucket_size\r\n        ,case when Popularity &gt; 0 then 'Pop'\r\n                   else 'Non-Pop'\r\n          end Popularity\r\n    from\r\n   (\r\n     select\r\n         uth.endpoint_number\r\n        ,uth.endpoint_actual_value\r\n        ,uth.endpoint_repeat_count\r\n        ,ucs.sample_size\/ucs.num_buckets bucket_size      \r\n        ,(uth.endpoint_repeat_count - ucs.sample_size\/ucs.num_buckets) Popularity\r\n    from\r\n        user_tab_histograms uth\r\n       ,user_tab_col_statistics ucs\r\n   where\r\n        uth.table_name   = ucs.table_name\r\n    and uth.column_name   = ucs.column_name\r\n    and uth.table_name    = 'T1'\r\n    and uth.column_name   = 'N1'\r\n    )\r\n  )\r\nwhere Popularity = 'Pop';\r\n\r\nENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR\r\n---------- --------------------- ----------- -------\r\n27                             6           5 Pop\r\n28                             6           5 Pop\r\n29                             6           5 Pop\r\n<span style=\"color: red;\">33<\/span>                             8           5 Pop\r\n35                             7           5 Pop\r\n\r\n5 rows selected.<\/pre>\n<p>Let&#8217;s get the Oracle optimizer cardinality estimation using one of the above values:<\/p>\n<pre>SQL&gt; explain plan for select count(1) from t1 where n1 = <span style=\"color: red;\">33<\/span>;\r\n\r\nSQL&gt; select * from table(dbms_xplan.display);\r\n-------------------------------------------\r\n| Id  | Operation          | Name | Rows  |\r\n-------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |     1 |\r\n|   1 |  SORT AGGREGATE    |      |     1 |\r\n|*  2 |   TABLE ACCESS FULL| T1   |     <span style=\"color: red;\">8<\/span> |\r\n-------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   2 - filter(\"N1\"=33)<\/pre>\n<p>In order to get the above cardinality estimation, Oracle uses the following formula:<\/p>\n<pre>E-Rows (12c) = ENDPOINT_REPEAT_COUNT * num_rows\/sample_size\r\nE-Rows (12c) = 8 * 100\/100 = <span style=\"color: red;\">8<\/span><\/pre>\n<h3>Non-Popular value with an endpoint number<\/h3>\n<p>For the data set considered in this article, the following values are non-popular values having an endpoint number:<\/p>\n<pre>SQL&gt; select\r\n         endpoint_actual_value\r\n        ,endpoint_repeat_count\r\n        ,bucket_size\r\n        ,Popularity\r\nfrom\r\n(\r\n  select\r\n         endpoint_number\r\n        ,endpoint_actual_value\r\n        ,endpoint_repeat_count\r\n        ,bucket_size\r\n        ,case when Popularity &gt; 0 then 'Pop'\r\n                   else 'Non-Pop'\r\n          end Popularity\r\n    from\r\n   (\r\n     select\r\n         uth.endpoint_number\r\n        ,uth.endpoint_actual_value\r\n        ,uth.endpoint_repeat_count\r\n        ,ucs.sample_size\/ucs.num_buckets bucket_size      \r\n        ,(uth.endpoint_repeat_count - ucs.sample_size\/ucs.num_buckets) Popularity\r\n    from\r\n        user_tab_histograms uth\r\n       ,user_tab_col_statistics ucs\r\n   where\r\n        uth.table_name   = ucs.table_name\r\n    and uth.column_name   = ucs.column_name\r\n    and uth.table_name    = 'T1'\r\n    and uth.column_name   = 'N1'\r\n    )\r\n  )\r\nwhere Popularity = 'Non-Pop';\r\n\r\nENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR\r\n---------- --------------------- ----------- -------\r\n8                              1           5 Non-Pop\r\n13                             3           5 Non-Pop\r\n18                             2           5 Non-Pop\r\n20                             5           5 Non-Pop\r\n23                             2           5 Non-Pop\r\n26                             3           5 Non-Pop\r\n31                             5           5 Non-Pop\r\n38                             5           5 Non-Pop\r\n41                             1           5 Non-Pop\r\n42                             2           5 Non-Pop\r\n<span style=\"color: red;\">43<\/span>                             3           5 Non-Pop\r\n44                             1           5 Non-Pop\r\n<span style=\"color: red;\">45<\/span>                             1           5 Non-Pop\r\n46                             1           5 Non-Pop\r\n59                             1           5 Non-Pop\r\n15 rows selected.<\/pre>\n<p>Let&#8217;s then get the Oracle optimizer cardinality estimation of one of the above non-popular values:<\/p>\n<pre>SQL&gt; explain plan for select count(1) from t1 where n1 = <span style=\"color: red;\">45<\/span>;\r\n\r\nSQL&gt; select * from table(dbms_xplan.display);\r\n--------------------------------------------\r\n| Id  | Operation          | Name | Rows  |\r\n--------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |     1 |\r\n|   1 |  SORT AGGREGATE    |      |     1 |\r\n|*  2 |   TABLE ACCESS FULL| T1   |     <span style=\"color: red;\">2<\/span> |\r\n--------------------------------------------\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   2 - filter(\"N1\"=45)<\/pre>\n<p>In contrast to what you might probably guess, the optimizer calculates the cardinality estimates for a non-popular value having endpoint number using the following different formula:<\/p>\n<pre>E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT\/sample_size)<\/pre>\n<p>The optimizer calculates the <em>NewDensity<\/em> using an internal algorithm based on factors like the number of distinct values, the number of popular buckets, the number of popular values, etc. Nevertheless, using the following select (inspired by <a href=\"http:\/\/www.adellera.it\/blog\/2009\/10\/23\/cbo-newdensity-for-frequency-histograms11g-10204-densities-part-iv\/\">Alberto Dell&#8217; Era<\/a>) we can get a reliable value for the NewDensity:<\/p>\n<pre>SELECT\r\n         BktCnt\r\n        ,PopBktCnt\r\n        ,PopValCnt\r\n        ,NDV\r\n        ,pop_bucketSize\r\n        ,trunc(((BktCnt-PopBktCnt)\/BktCnt)\/(NDV-PopValCnt),10) <span style=\"color: red;\">NewDensity<\/span>\r\n     FROM\r\n        (SELECT\r\n           COUNT(1) PopValCnt,\r\n           SUM(endpoint_repeat_count) PopBktCnt,\r\n           ndv,\r\n           BktCnt,\r\n           pop_bucketSize\r\n         FROM\r\n          (SELECT\r\n            (sample_size - num_nulls) BktCnt,\r\n            num_distinct ndv,\r\n            num_buckets,\r\n            density OldDensity,\r\n            (sample_size-num_nulls)\/num_buckets pop_bucketSize\r\n          FROM user_tab_col_statistics\r\n          WHERE\r\n              table_name  = 'T1'\r\n          AND column_name = 'N1'\r\n          ),\r\n          user_histograms\r\n        WHERE table_name         = 'T1'\r\n        AND column_name          = 'N1'\r\n        AND endpoint_repeat_count&gt; pop_bucketSize\r\n        GROUP BY ndv,\r\n          BktCnt,\r\n          pop_bucketSize\r\n        );\r\n    BKTCNT  POPBKTCNT  POPVALCNT        NDV POP_BUCKETSIZE NEWDENSITY\r\n---------- ---------- ---------- ---------- -------------- ----------\r\n       100         33          5         37              5   <span style=\"color: red;\">.0209375<\/span><\/pre>\n<p>Using this computed <em>NewDensity<\/em> in the above cardinality estimation formula gives this:<\/p>\n<pre>E-Rows (12c) = num_rows * greatest (<span style=\"color: red;\">NewDensity<\/span>, ENDPOINT_REPEAT_COUNT\/ sample_size)\r\nE-Rows (12c) = 100 * greatest (.0209375, 1\/100) = <span style=\"color: red;\">2.09375<\/span> ~ <span style=\"color: red;\">2<\/span><\/pre>\n<p>The corresponding 10053 CBO trace backs the above arithmetic as shown in the following:<\/p>\n<pre>=====================================\r\nAccess path analysis for T1\r\n***************************************\r\nSINGLE TABLE ACCESS PATH \r\n  Single Table Cardinality Estimation for T1[T1] \r\n  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE\r\nColumn (#2): \r\n<span style=\"color: red;\">NewDensity<\/span>:<span style=\"background-color: yellow;\"> 0.020938<\/span>, OldDensity: 0.026167 <span style=\"color: red;\">BktCnt: 100<\/span>, <span style=\"color: red;\">PopBktCnt: 33<\/span>, <span style=\"color: red;\">PopValCnt: 5<\/span>, <span style=\"color: red;\">NDV: 37<\/span>\r\n\r\nColumn (#2): N1(NUMBER)\r\n    AvgLen: 3 NDV: 37 Nulls: 0 Density: 0.020938 Min: 8.000000 Max: 59.000000\r\n    Histogram: <span style=\"background-color: yellow;\">Hybrid<\/span>  #Bkts: 20  UncompBkts: 100  EndPtVals: 20  ActualVal: yes\r\n  Table: T1 Alias: T1\r\n   Card: Original: 100.  Rounded: <span style=\"color: red;\">2<\/span> Computed: <span style=\"background-color: yellow;\">2.093750<\/span> Non Adjusted: 2.093750\r\n  \r\n  Best:: AccessPath: TableScan\r\n         Cost: 3.001739 Degree: 1 Resp: 3.001739 Card: 2.093750 Bytes: 0.000000<\/pre>\n<p>Note how this trace file shows the same information (<code>NewDensity, BktCnt, PopBktCnt, PopValCnt<\/code> and <code>NDV<\/code>) we have selected above when calculating the NewDensity. The same trace file also shows that the computed cardinality is exactly equal to the one we&#8217;ve calculated using the supplied formula.<\/p>\n<p>In order to give more credit to this formula let&#8217;s apply it to get the cardinality estimation of another non-popular value having an endpoint_repeat_count of 3 which is greater than 1 as shown below:<\/p>\n<pre>SQL&gt; explain plan for select count(1) from t1 where n1 = <span style=\"color: red;\">43<\/span>;\r\n\r\nSQL&gt; select * from table(dbms_xplan.display);\r\n-------------------------------------------\r\n| Id  | Operation          | Name | Rows  |\r\n-------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |     1 |\r\n|   1 |  SORT AGGREGATE    |      |     1 |\r\n|*  2 |   TABLE ACCESS FULL| T1   |     <span style=\"color: red;\">3<\/span> |\r\n-------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   2 - filter(\"N1\"=43)<\/pre>\n<pre>E-Rows (12c) = num_rows * greatest (<span style=\"color: red;\">NewDensity<\/span>, ENDPOINT_REPEAT_COUNT\/ sample_size)\r\nE-Rows (12c) = 100 * greatest (.0209375, 3\/100) = <span style=\"color: red;\">3<\/span><\/pre>\n<h3>Non-Popular value without an endpoint number<\/h3>\n<p>In order to calculate the cardinality estimates of a hybrid column value that has not been captured into the histogram, the Optimizer uses the following formula:<\/p>\n<pre>E-Rows (12c) = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ <span style=\"color: red;\">2<\/span><\/pre>\n<p>The code below illustrates this formula applied to one of the non-captured values:<\/p>\n<pre>SQL&gt; explain plan for select count(1) from t1 where n1 = <span style=\"color: red;\">17<\/span>;\r\n\r\nSQL&gt; select * from table(dbms_xplan.display);\r\n-------------------------------------------\r\n| Id  | Operation          | Name | Rows  |\r\n-------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |     1 |\r\n|   1 |  SORT AGGREGATE    |      |     1 |\r\n|*  2 |   TABLE ACCESS FULL| T1   |     <span style=\"color: red;\">2<\/span> |\r\n-------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   2 - filter(\"N1\"=17)<\/pre>\n<p>As the above formula suggests, <strong>all<\/strong> non-popular hybrid histogram values that have not been captured will <strong>always<\/strong> have the <strong>same constant<\/strong> cardinality estimation whatever their real count is in the table.<\/p>\n<h2>Instability<\/h2>\n<p>We stated above that when applying rule 3) to ensure the minimum bucket size of 5 when building its Hybrid histogram, Oracle introduces a certain instability. With this actual data set of 100 rows and 37 distinct values, Oracle ended up by making the value 17 as non-popular without an endpoint and 18 as an endpoint number with a endpoint_repeat_count of 2 as shown below:<\/p>\n<pre>ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR\r\n--------------- ---------- --------------------- ----------- -------\r\n              1 8                              1           5 Non-Pop\r\n              6 13                             3           5 Non-Pop\r\n             12 18                             2           5 Non-Pop<\/pre>\n<p>If we add a new instance of 16 the reverse situation will happen:<\/p>\n<pre>SQL&gt; insert into t1 values (2, 16);\r\n\r\nSQL&gt; BEGIN\r\n        dbms_stats.gather_table_stats\r\n          (user, 't1', method_opt =&gt; 'for all columns size 20');\r\n     END;\r\n    \/\r\n\r\nSQL&gt; select\r\n         uth.endpoint_number\r\n        ,uth.endpoint_actual_value\r\n        ,uth.endpoint_repeat_count\r\n    from\r\n       user_tab_histograms uth\r\n       ,user_tables ut\r\n       ,user_tab_col_statistics ucs\r\n   where\r\n        uth.table_name    = 'T1'\r\n       and uth.column_name   = 'N1'\r\n       and uth.table_name    = ut.table_name\r\n       and ut.table_name     = ucs.table_name\r\n       and uth.column_name   = ucs.column_name\r\n      order by uth.endpoint_number;\r\n\r\nENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT\r\n--------------- ---------- ---------------------\r\n              1 8                              1\r\n              6 13                             3\r\n             11 17                             1 ***\r\n             16 19                             3\r\n             21 20                             5\r\n             ...\r\n             99 46                             1\r\n            101 59                             1\r\n20 rows selected.<\/pre>\n<p>See how &#8220;18&#8221; has been excluded from the histogram information while &#8220;17&#8221; has been incorporated into it. This new situation is the result of the following new data set organization:<\/p>\n<pre>8 12 12 13 13|<span style=\"color: red;\">13<\/span> 15 16 16 16|<span style=\"color: red;\">17<\/span> 18 18 19 19 19<\/pre>\n<p>Which, when passed through the Hybrid histogram crunching strategy gives this:<\/p>\n<pre>8 12 12 13 13 <span style=\"color: red;\">13<\/span>| 15 16 16 16 <span style=\"color: red;\">17<\/span> |18 18 19 19 19<\/pre>\n<p>Add an extra instance of \u201816&#8242; and both 17 and 18 will quit the histogram information while \u201816&#8242; will be recorded with a correct frequency:<\/p>\n<pre>SQL&gt; insert into t1 values (3, 16);\r\n\r\nSQL&gt; BEGIN\r\n      dbms_stats.gather_table_stats\r\n          (user, 't1', method_opt =&gt; 'for all columns size 20');\r\n    END;\r\n \/\r\n\r\nENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT\r\n--------------- ---------- ---------------------\r\n              1 8                              1\r\n              6 13                             3\r\n             11 16                             4 ***\r\n             17 19                             3\r\n             22 20                             5\r\n             ...\r\n            100 46                             1\r\n            102 59                             1\r\n20 rows selected.<\/pre>\n<h2>Conclusion<\/h2>\n<p>Frequency histograms, provided you&#8217;ve collected them at the right moment using a representative sampling, can help the optimizer in getting correct cardinality estimates. However, their internal upper bound limit of 254 distinct values drastically limits their usefulness in favor of the imprecise height balanced histogram which needs, for a distinct value to be properly captured, to have one of its instances appearing into at least two endpoints of two distinct buckets. Oracle 12cR1 has not only extended the internal limit of frequency histogram to 2048 but has implemented two new types of histogram: Top-Frequency and Hybrid. The later histogram type combines the best attributes of its two predecessors; that is the accuracy of frequency histogram by counting the number of times a distinct value repeats itself and the bucket saving space of the height balanced histogram by collapsing all instances of the same distinct value into a single bucket without increasing the original bucket number. The hybrid histogram generates more popular values and therefore a greater chance for the Oracle Optimizer to come up with an optimal execution plan.<\/p>\n<p>Code for the examples:\u00a0<a href=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/oracle\/2016\/01\/InsT1.zip\" rel=\"\">InsT1<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous article about the Oracle 12c TOP-Frequency histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in\u00a0the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram ty&hellip;<\/p>\n","protected":false},"author":305852,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143533],"tags":[],"coauthors":[],"class_list":["post-73136","post","type-post","status-publish","format-standard","hentry","category-oracle-databases"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73136","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/305852"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=73136"}],"version-history":[{"count":1,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73136\/revisions"}],"predecessor-version":[{"id":91632,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73136\/revisions\/91632"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=73136"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=73136"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=73136"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=73136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}