12c Hybrid histogram

In the previous article about the Oracle 12c TOP-Frequency histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram type which occurs when the data set fails to qualify for a TOP-Frequency. Firstly, it shows that the method used by Oracle to compute a Hybrid histogram is strongly related to the TOP-Frequency preliminary determination step. It then shows that a column value can be popular, non-popular with an endpoint value or non-popular without an endpoint value. It then examines the formulas used by the optimizer to compute its cardinality estimation for each of the above three cases.

Natural or Adjusted?

In order to investigate the method used by Oracle to calculate a Hybrid histogram, I am going to use a data set example taken from Jonathan Lewis‘ blog as shown below:

SQL> create table t1 (id number, n1 number);

SQL> start InsT1.sqlScript (<a href="https://www.red-gate.com/simple-talk/wp-content/uploads/oracle/2016/01/InsT1.zip" rel="">see downloadable script at the end</a>)

In the following, I am going to gather Hybrid histogram in 12c (12.1.0.2.0) by specifying fewer buckets (20) than distinct value (37) for the current data set and using default sampling:

SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (1+16));

SQL> BEGIN

dbms_stats.gather_table_stats

(user, 't1', method_opt => 'for columns n1 size 20');

END;

SQL> exec dbms_stats.set_global_prefs('TRACE', null);

As in the previous article, the above call to dbms_stats package has been traced so that we can explore its content in the following snippet:

DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME

DBMS_STATS: Y Y Y Y Y Y Y Y Y N1

DBMS_STATS: Approximate NDV Options

DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U20U

DBMS_STATS: start processing top n values for column "N1"

DBMS_STATS: >> frequency histograms is not feasible

(topn_values is null), skip!

DBMS_STATS: Iteration: 1 numhist: 1

DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME

DBMS_STATS: Y Y Y Y Y N1

DBMS_STATS: Building Histogram for N1

DBMS_STATS: bktnum=20, nnv=100, snnv=100, sndv=37, est_ndv=37, mnb=20

DBMS_STATS: Trying hybrid histogram

As expected, Oracle started by carrying out the usual TOP-Frequency preliminary determination step. It is worth emphasizing that this data set fails to immediately qualify for a TOP-Frequency histogram because the TOP-20 values account for 80 rows while the threshold is 95 rows, as shown below:

SQL> select

sum (cnt) TopNRows

from (select

,count(*) cnt

from t1

group by n1

order by count(*) desc

)

where rownum <= 20;

TOPNROWS

----------

SQL> select round ((20-1)/20 * 100) threshold from dual;

THRESHOLD

----------

This is what I tend to qualify as a “Natural Hybrid” histogram representing a data set which doesn’t qualify for a TOP-Frequency because the threshold is greater than the naturally non-adjusted TopNRows, as explained in the previous article. However, what happens for a data set that initially satisfies the TOP-Frequency threshold but fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the Adjusted TopNRows is greater than the threshold. A dbms_stats trace file of such a kind of data set shows a different content:

DBMS_STATS: start processing top n values for column "N1"

DBMS_STATS: >> frequency histograms is not feasible

(topn_values is null), skip!

DBMS_STATS: start processing top n values for column "N2"

DBMS_STATS: topn sql (len: 589):

DBMS_STATS: +++ select /*+ no_parallel(t) no_parallel_index(t)

dbms_stats cursor_sharing_exact use_weak_name_resl

dynamic_sampling(0) no_monitoring

xmlindex_sel_idx_tbl no_substrb_pad

substrb(dump("N2",16,0,64),1,240) val

,rowidtochar(rowid) rwid

from "XXXX"."T1" t

where rowid in

(chartorowid('AAAJmWAAEAACN1zAAA'),chartorowid('AAAJmWAAEAACN1zAAB')

,chartorowid('AAAJmWAAEAACN1zAAC'),chartorowid('AAAJmWAAEAACN1zAAF'

,chartorowid('AAAJmWAAEAACN1zAAG'),chartorowid('AAAJmWAAEAACN1zAAH')

,chartorowid('AAAJmWAAEAACN1zAAJ'),chartorowid('AAAJmWAAEAACN1zAAU'))

order by "N2"

DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,2b add: Typ=2 Len=3: c2,2,1a

DBMS_STATS: removal_count: 1 total_nonnull_rows: 200 mnb: 8

DBMS_STATS: Abort top-n histogram, as the addition of min/max does not preserve the minimum coverage: .85225054 vs. .91

DBMS_STATS: Building Histogram for N2

DBMS_STATS: Trying hybrid histogram

While in the first case Oracle immediately abandoned the Top-Frequency path, in the second case it went a little bit deeper before aborting the top-frequency process in favor of a Hybrid histogram. This is what I tend to qualify as the “Adjusted Hybrid” histogram representing a data set that fails to qualify for a TOP-Frequency because Oracle finds that the threshold is greater than the Adjusted TopNRows in the middle of the process.

The method used by Oracle to compute Hybrid histogram depends on whether the data set is of a “Natural Hybrid” type or of an “Adjusted Hybrid” one. For the later type Oracle will take profit of the pre-process it has accomplished and transform the aborted top-frequency into a hybrid histogram. In the former case Oracle will use a bucket-frequency strategy.

The data set used in this article qualifies for a “Natural Hybrid” histogram. This is why the next section explains the method used by Oracle to collect this kind of hybrid histogram. The explanation of the “Adjusted Hybrid” method will be left to a separate article.

The method

Oracle has computed a Hybrid histogram for the current data set using a bucket size of 5 (sample_size/num_buckets = 100/20), as shown below:

SQL> select

column_name

,num_distinct

,num_buckets

,sample_size

,histogram

from

user_tab_col_statistics

where table_name = 'T1'

and column_name ='N1';

COLUMN_NAME NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM

------------ ------------ ----------- ----------- -----------

N1 37 20 100 HYBRID

At this stage of the article, let’s label this bucket size as the “original” bucket size. The method used by Oracle to compute this kind of histogram seems to obey the following strategy:

a distinct value should not be found in more than one bucket
a bucket size is allowed to be extended in order to contain all instances of the same distinct value

As long as the bucket size can be extended so that no distinct value will span multiple buckets, deciphering the new size of the extended buckets represents a crucial step in figuring out the method used by Oracle to compute Hybrid histogram. Below is the histogram information stored in the data dictionary:

SQL> SELECT

(row_number() over(order by ept_nbr)-1) NumBucket

,ept_nbr

,ept_act_val

,rpt_cnt

,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size"

,bucket_size "original bucket_size"

FROM

(SELECT

ah.endpoint_number ept_nbr

,ah.endpoint_actual_value ept_act_val

,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag

,ah.endpoint_repeat_count rpt_cnt

,at.sample_size/at.num_buckets bucket_size

FROM

user_tab_histograms ah

,user_tab_col_statistics at

WHERE ah.table_name = at.table_name

AND ah.column_name = at.column_name

AND ah.table_name = 'T1'

AND ah.column_name = 'N1'

) ORDER BY ept_nbr;

NUMBUCKET EPT_NBR EPT_ACT_VA RPT_CNT new bucket size original bucket_size

---------- ---------- ---------- ---------- --------------- --------------------

0 1 8 1 1 5

1 6 13 3 5 5

2 12 18 2 6 5

3 20 20 5 8 5

4 26 23 2 6 5

5 32 26 3 6 5

6 38 27 6 6 5

7 44 28 6 6 5

8 50 29 6 6 5

9 58 31 5 8 5

10 69 33 8 11 5

11 79 35 7 10 5

*** 12 86 38 5 7 5

13 90 41 1 4 5

14 92 42 2 2 5

15 95 43 3 3 5

16 96 44 1 1 5

17 97 45 1 1 5

18 98 46 1 1 5

19 100 59 1 2 5

20 rows selected.

Excluding the boundaries, and up to the 13^th endpoint number 86, all new bucket sizes are greater or equal to the original bucket size. In order to have a clue about what’s happening behind the scenes, let’s draw an ordered picture of the data set divided into 20 buckets each of size 5 as shown below:

8 12 12 13 13 |13 15 16 16 17 |18 18 19 19 19|20 20 20 20 20|21 22 22 22 23|23 24 24 25 26|

26 26 27 27 27|27 27 27 28 28 |28 28 28 28 29|29 29 29 29 29|30 30 30 31 31|31 31 31 32 32|

32 33 33 33 33|33 33 33 33 34 |34 34 35 35 35|35 35 35 35 36|37 38 38 38 38|38 39 39 40 41|

42 42 43 43 43|44 45 46 50 59|

Oracle starts walking the above ordered data set until it finds that value 13 traverses two buckets. Based on the first rule above, Oracle extends bucket n°1 by moving all instances of 13 from bucket_id n°2 into it, giving us these intermediary results:

1	8 12 12 13 13 <span style="color: red;">13</span>\| 15 16 16 17 \|18 18 19 19 19\|

Note that 13 is not only an endpoint number of bucket n°1 but since it is uniquely located into a single bucket, Oracle is able to count its frequency of 3 (endpoint_repeat_count) as shown in the histogram information reproduced below:

NUMBUCKET EPT_NBR EPT_ACT_VA RPT_CNT new bucket size original bucket_size

---------- ---------- ---------- ---------- --------------- --------------------

1 6 13 3 5 5

2 12 18 2 6 5

However, the above histogram information is showing that bucket n°2 is of size 6 and has 18 as its endpoint number, with a frequency of 2. This is not in accordance with this picture:

1	8 12 12 13 13 <span style="color: red;">13</span>\| 15 16 16 17 \|18 18 19 19 19\|

Looking at this new situation we can spot that when jumping from bucket n°2 to bucket n°3 there are no distinct values crossing multiple buckets. So we could keep this bucket with its new size of 4. This, however, is not the case. In fact, it appears that there is a third rule used by Oracle when generating “Natural Hybrid” histogram which is:

do not create buckets with a size less than the original size (5) –not applicable at the end of the data set–

As long as the above rule is applied, with bucket n°2 being of size 4, it has to be extended to reach a size of 5. This is why the first instance from bucket n°3 (18) has to be moved into bucket n°2 as in the following:

1	8 12 12 13 13 13\| 15 16 16 17 18\| 18 19 19 19\|

However, since there are two instances of 18 in bucket n°3 they have to move together to obey the first rule in the strategy: no distinct values spanning multiple buckets. This gives us the following situation:

1	8 12 12 13 13 <span style="color: red;">13</span>\| 15 16 16 17 <span style="color: red;">18 18</span>\| 19 19 19

As a result of this, we see that we do indeed have a bucket n°2 of size 6, with 18 as its endpoint number and a frequency of 2. We will see later in this article that it is this rule 3) of the may introduce instability in the Hybrid histogram.

Oracle will continue traversing the rest of the data set applying the three point strategy above until it finds that value 23 spans multiple buckets:

1	8 12 12 13 13 <span style="color: red;">13</span>\| 15 16 16 17 <span style="color: red;">18 18</span>\| 19 19 19 <span style="color: red;">20 20 20 20 20</span>\|-\|21 22 22 22 23\|<span style="color: red;">23</span> 24 24 25 26\|

Applying the same rule again, all instances of values 23 will be moved from bucket _id n°6 to bucket_id n°5, resulting in the following situation:

1	8 12 12 13 13 <span style="color: red;">13</span>\| 15 16 16 17 <span style="color: red;">18 18</span>\| 19 19 19 <span style="color: red;">20 20 20 20 20</span>\|-\|21 22 22 22 23 <span style="color: red;">23</span>\| 24 24 25 26\|

And so on until it reaches the end of the data set producing such a final histogram result:

8 12 12 13 13 13|15 16 16 17 18 18|19 19 19 20 20 20 20 20|-|21 22 22 22 23 23|24 24 25 26 26 26|27 27 27 27 27 27|28 28 28 28 28 28 |29 29 29 29 29 29|-|30 30 30 31 31 31 31 31|32 32 32 33 33 33 33 33 33 33 33|-|34 34 34 35 35 35 35 35 35 35|-|36 37 38 38 38 38 38|-|39 39 40 41|42 42 |43 43 43|44|45|46|50 59|

You will notice that, starting from the thirteenth buckets at value 39, the bucket extension rule and minimum size requirement seem to be ignored. At this stage of the data set Oracle is left with 14 distinct values to spread over 8 buckets. This last phrase paved the way to what seems to be the fourth rule applied by Oracle when gathering Natural Hybrid histogram:

the original number of buckets (20) should not be reduced

Instead of reducing the original number of buckets (20), Oracle seems to divide the number of remaining distinct values (14) by the number of remaining buckets (8) and spread them accordingly. Unfortunately, I didn’t succeed in working out the exact strategy used by Oracle in this last step to spread the data across the remaining buckets, rather than the one which consists of keeping the initial number of buckets unchanged.

Popular or non-popular?

Oracle developers and tuning specialists are interested in having optimal execution plans. When the Optimizer is supplied with accurate statistics it will mostly come up with an optimal execution plan. However, when the optimizer uses a histogram to get its cardinality estimate, the arithmetic it employs for a popular value differs from that used for a non-popular value. This is why identifying the popularity of a column value represents a crucial step in the tuning process. This section of the article shows when a column with a hybrid histogram is considered popular or not and shows the arithmetic used by Oracle to compute the cardinality estimate for each case.

When it comes to a hybrid histogram Oracle considers a value to be popular when its recorded endpoint_repeat_count value exceeds the original bucket size. The endpoint_repeat_count is a new 12c histogram column property in which Oracle stores the number of times a distinct value has been repeated in a bucket. Let’s put this in a SQL format:

SQL> select

uth.endpoint_number

,uth.endpoint_actual_value

,uth.endpoint_repeat_count

,ucs.sample_size/ucs.num_buckets bucket_size

,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity

from

user_tab_histograms uth

,user_tab_col_statistics ucs

where

uth.table_name = ucs.table_name

and uth.column_name = ucs.column_name

and uth.table_name = 'T1'

and uth.column_name = 'N1'

order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY

--------------- ---------- --------------------- ----------- ----------

1 8 1 5 -4

6 13 3 5 -2

12 18 2 5 -3

20 20 5 5 0

26 23 2 5 -3

32 26 3 5 -2

38 27 6 5 1

44 28 6 5 1

50 29 6 5 1

58 31 5 5 0

69 33 8 5 3

79 35 7 5 2

86 38 5 5 0

90 41 1 5 -4

92 42 2 5 -3

95 43 3 5 -2

96 44 1 5 -4

97 45 1 5 -4

98 46 1 5 -4

100 59 1 5 -4

20 rows selected.

The Popularity column above represents the difference between the frequency of appearance of each column and the bucket size (sample_size/num_buckets). It can easily be inferred from this column that a value is considered popular when its Popularity column value is greater than zero. Refactoring the above query a little gives this:

SQL> select

endpoint_number

,endpoint_actual_value

,endpoint_repeat_count

,bucket_size

,case when Popularity > 0 then 'Pop'

else 'Non-Pop'

end Popularity

from

(

select

uth.endpoint_number

,uth.endpoint_actual_value

,uth.endpoint_repeat_count

,ucs.sample_size/ucs.num_buckets bucket_size

,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity

from

user_tab_histograms uth

,user_tab_col_statistics ucs

where

uth.table_name = ucs.table_name

and uth.column_name = ucs.column_name

and uth.table_name = 'T1'

and uth.column_name = 'N1'

)

order by endpoint_number;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR

--------------- ---------- --------------------- ----------- -------

1 8 1 5 Non-Pop

6 13 3 5 Non-Pop

12 18 2 5 Non-Pop

20 20 5 5 Non-Pop

26 23 2 5 Non-Pop

32 26 3 5 Non-Pop

38 27 6 5 Pop

44 28 6 5 Pop

50 29 6 5 Pop

58 31 5 5 Non-Pop

69 33 8 5 Pop

79 35 7 5 Pop

86 38 5 5 Non-Pop

90 41 1 5 Non-Pop

92 42 2 5 Non-Pop

95 43 3 5 Non-Pop

96 44 1 5 Non-Pop

97 45 1 5 Non-Pop

98 46 1 5 Non-Pop

100 59 1 5 Non-Pop

20 rows selected.

Simply put, a captured Hybrid histogram value is considered to be:

popular when its endpoint_repeat_count is greater than the original bucket size
non-popular with a recorded endpoint number in the histogram
non-popular and absent from the histogram information

Thus, in contrast to its predecessor, the Height Balance histogram, it’s not sufficient for a Hybrid value to be captured into the histogram with an endpoint_repeat_count greater than 1 in order for it to be considered popular.

Cardinality estimation

As stated above the popularity of Hybrid histogram value dictates the arithmetic used by Oracle to calculate its estimation. The preceding section illustrates three types of Hybrid histogram values:

popular value
non-popular value with an endpoint number
non-popular value not present in the histogram table

Let’s start by figuring out how Oracle estimates the cardinality of the first type:

Popular value

Below is a list of popular values in the current data set:

SQL> select

endpoint_actual_value

,endpoint_repeat_count

,bucket_size

,Popularity

from

(

select

endpoint_number

,endpoint_actual_value

,endpoint_repeat_count

,bucket_size

,case when Popularity > 0 then 'Pop'

else 'Non-Pop'

end Popularity

from

(

select

uth.endpoint_number

,uth.endpoint_actual_value

,uth.endpoint_repeat_count

,ucs.sample_size/ucs.num_buckets bucket_size

,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity

from

user_tab_histograms uth

,user_tab_col_statistics ucs

where

uth.table_name = ucs.table_name

and uth.column_name = ucs.column_name

and uth.table_name = 'T1'

and uth.column_name = 'N1'

)

where Popularity = 'Pop';

ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR

---------- --------------------- ----------- -------

27 6 5 Pop

28 6 5 Pop

29 6 5 Pop

33 8 5 Pop

35 7 5 Pop

5 rows selected.

Let’s get the Oracle optimizer cardinality estimation using one of the above values:

SQL> explain plan for select count(1) from t1 where n1 = 33;

SQL> select * from table(dbms_xplan.display);

-------------------------------------------

-------------------------------------------

| 0 | SELECT STATEMENT | | 1 |

| 1 | SORT AGGREGATE | | 1 |

|* 2 | TABLE ACCESS FULL| T1 | 8 |

-------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("N1"=33)

In order to get the above cardinality estimation, Oracle uses the following formula:

1 2	E-Rows (12c) = ENDPOINT_REPEAT_COUNT * num_rows/sample_size E-Rows (12c) = 8 * 100/100 = <span style="color: red;">8</span>

Non-Popular value with an endpoint number

For the data set considered in this article, the following values are non-popular values having an endpoint number:

SQL> select

endpoint_actual_value

,endpoint_repeat_count

,bucket_size

,Popularity

from

(

select

endpoint_number

,endpoint_actual_value

,endpoint_repeat_count

,bucket_size

,case when Popularity > 0 then 'Pop'

else 'Non-Pop'

end Popularity

from

(

select

uth.endpoint_number

,uth.endpoint_actual_value

,uth.endpoint_repeat_count

,ucs.sample_size/ucs.num_buckets bucket_size

,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity

from

user_tab_histograms uth

,user_tab_col_statistics ucs

where

uth.table_name = ucs.table_name

and uth.column_name = ucs.column_name

and uth.table_name = 'T1'

and uth.column_name = 'N1'

)

where Popularity = 'Non-Pop';

ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR

---------- --------------------- ----------- -------

8 1 5 Non-Pop

13 3 5 Non-Pop

18 2 5 Non-Pop

20 5 5 Non-Pop

23 2 5 Non-Pop

26 3 5 Non-Pop

31 5 5 Non-Pop

38 5 5 Non-Pop

41 1 5 Non-Pop

42 2 5 Non-Pop

43 3 5 Non-Pop

44 1 5 Non-Pop

45 1 5 Non-Pop

46 1 5 Non-Pop

59 1 5 Non-Pop

15 rows selected.

Let’s then get the Oracle optimizer cardinality estimation of one of the above non-popular values:

SQL> explain plan for select count(1) from t1 where n1 = 45;

SQL> select * from table(dbms_xplan.display);

--------------------------------------------

--------------------------------------------

| 0 | SELECT STATEMENT | | 1 |

| 1 | SORT AGGREGATE | | 1 |

|* 2 | TABLE ACCESS FULL| T1 | 2 |

--------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("N1"=45)

In contrast to what you might probably guess, the optimizer calculates the cardinality estimates for a non-popular value having endpoint number using the following different formula:

1	E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size)

The optimizer calculates the NewDensity using an internal algorithm based on factors like the number of distinct values, the number of popular buckets, the number of popular values, etc. Nevertheless, using the following select (inspired by Alberto Dell’ Era) we can get a reliable value for the NewDensity:

SELECT

BktCnt

,PopBktCnt

,PopValCnt

,NDV

,pop_bucketSize

,trunc(((BktCnt-PopBktCnt)/BktCnt)/(NDV-PopValCnt),10) NewDensity

FROM

(SELECT

COUNT(1) PopValCnt,

SUM(endpoint_repeat_count) PopBktCnt,

ndv,

BktCnt,

pop_bucketSize

FROM

(SELECT

(sample_size - num_nulls) BktCnt,

num_distinct ndv,

num_buckets,

density OldDensity,

(sample_size-num_nulls)/num_buckets pop_bucketSize

FROM user_tab_col_statistics

WHERE

table_name = 'T1'

AND column_name = 'N1'

user_histograms

WHERE table_name = 'T1'

AND column_name = 'N1'

AND endpoint_repeat_count> pop_bucketSize

GROUP BY ndv,

BktCnt,

pop_bucketSize

);

BKTCNT POPBKTCNT POPVALCNT NDV POP_BUCKETSIZE NEWDENSITY

---------- ---------- ---------- ---------- -------------- ----------

100 33 5 37 5 .0209375

Using this computed NewDensity in the above cardinality estimation formula gives this:

1 2	E-Rows (12c) = num_rows * greatest (<span style="color: red;">NewDensity</span>, ENDPOINT_REPEAT_COUNT/ sample_size) E-Rows (12c) = 100 * greatest (.0209375, 1/100) = <span style="color: red;">2.09375</span> ~ <span style="color: red;">2</span>

The corresponding 10053 CBO trace backs the above arithmetic as shown in the following:

=====================================

Access path analysis for T1

***************************************

SINGLE TABLE ACCESS PATH

Single Table Cardinality Estimation for T1[T1]

SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE

Column (#2):

NewDensity: 0.020938, OldDensity: 0.026167 BktCnt: 100, PopBktCnt: 33, PopValCnt: 5, NDV: 37

Column (#2): N1(NUMBER)

AvgLen: 3 NDV: 37 Nulls: 0 Density: 0.020938 Min: 8.000000 Max: 59.000000

Histogram: Hybrid #Bkts: 20 UncompBkts: 100 EndPtVals: 20 ActualVal: yes

Table: T1 Alias: T1

Card: Original: 100. Rounded: 2 Computed: 2.093750 Non Adjusted: 2.093750

Best:: AccessPath: TableScan

Cost: 3.001739 Degree: 1 Resp: 3.001739 Card: 2.093750 Bytes: 0.000000

Note how this trace file shows the same information (NewDensity, BktCnt, PopBktCnt, PopValCnt and NDV) we have selected above when calculating the NewDensity. The same trace file also shows that the computed cardinality is exactly equal to the one we’ve calculated using the supplied formula.

In order to give more credit to this formula let’s apply it to get the cardinality estimation of another non-popular value having an endpoint_repeat_count of 3 which is greater than 1 as shown below:

SQL> explain plan for select count(1) from t1 where n1 = 43;

SQL> select * from table(dbms_xplan.display);

-------------------------------------------

-------------------------------------------

| 0 | SELECT STATEMENT | | 1 |

| 1 | SORT AGGREGATE | | 1 |

|* 2 | TABLE ACCESS FULL| T1 | 3 |

-------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("N1"=43)

1 2	E-Rows (12c) = num_rows * greatest (<span style="color: red;">NewDensity</span>, ENDPOINT_REPEAT_COUNT/ sample_size) E-Rows (12c) = 100 * greatest (.0209375, 3/100) = <span style="color: red;">3</span>

Non-Popular value without an endpoint number

In order to calculate the cardinality estimates of a hybrid column value that has not been captured into the histogram, the Optimizer uses the following formula:

1	E-Rows (12c) = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ <span style="color: red;">2</span>

The code below illustrates this formula applied to one of the non-captured values:

SQL> explain plan for select count(1) from t1 where n1 = 17;

SQL> select * from table(dbms_xplan.display);

-------------------------------------------

-------------------------------------------

| 0 | SELECT STATEMENT | | 1 |

| 1 | SORT AGGREGATE | | 1 |

|* 2 | TABLE ACCESS FULL| T1 | 2 |

-------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - filter("N1"=17)

As the above formula suggests, all non-popular hybrid histogram values that have not been captured will always have the same constant cardinality estimation whatever their real count is in the table.

Instability

We stated above that when applying rule 3) to ensure the minimum bucket size of 5 when building its Hybrid histogram, Oracle introduces a certain instability. With this actual data set of 100 rows and 37 distinct values, Oracle ended up by making the value 17 as non-popular without an endpoint and 18 as an endpoint number with a endpoint_repeat_count of 2 as shown below:

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR

--------------- ---------- --------------------- ----------- -------

1 8 1 5 Non-Pop

6 13 3 5 Non-Pop

12 18 2 5 Non-Pop

If we add a new instance of 16 the reverse situation will happen:

SQL> insert into t1 values (2, 16);

SQL> BEGIN

dbms_stats.gather_table_stats

(user, 't1', method_opt => 'for all columns size 20');

END;

SQL> select

uth.endpoint_number

,uth.endpoint_actual_value

,uth.endpoint_repeat_count

from

user_tab_histograms uth

,user_tables ut

,user_tab_col_statistics ucs

where

uth.table_name = 'T1'

and uth.column_name = 'N1'

and uth.table_name = ut.table_name

and ut.table_name = ucs.table_name

and uth.column_name = ucs.column_name

order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT

--------------- ---------- ---------------------

1 8 1

6 13 3

11 17 1 ***

16 19 3

21 20 5

...

99 46 1

101 59 1

20 rows selected.

See how “18” has been excluded from the histogram information while “17” has been incorporated into it. This new situation is the result of the following new data set organization:

1	8 12 12 13 13\|<span style="color: red;">13</span> 15 16 16 16\|<span style="color: red;">17</span> 18 18 19 19 19

Which, when passed through the Hybrid histogram crunching strategy gives this:

1	8 12 12 13 13 <span style="color: red;">13</span>\| 15 16 16 16 <span style="color: red;">17</span> \|18 18 19 19 19

Add an extra instance of ‘16′ and both 17 and 18 will quit the histogram information while ‘16′ will be recorded with a correct frequency:

SQL> insert into t1 values (3, 16);

SQL> BEGIN

dbms_stats.gather_table_stats

(user, 't1', method_opt => 'for all columns size 20');

END;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT

--------------- ---------- ---------------------

1 8 1

6 13 3

11 16 4 ***

17 19 3

22 20 5

...

100 46 1

102 59 1

20 rows selected.

Conclusion

Frequency histograms, provided you’ve collected them at the right moment using a representative sampling, can help the optimizer in getting correct cardinality estimates. However, their internal upper bound limit of 254 distinct values drastically limits their usefulness in favor of the imprecise height balanced histogram which needs, for a distinct value to be properly captured, to have one of its instances appearing into at least two endpoints of two distinct buckets. Oracle 12cR1 has not only extended the internal limit of frequency histogram to 2048 but has implemented two new types of histogram: Top-Frequency and Hybrid. The later histogram type combines the best attributes of its two predecessors; that is the accuracy of frequency histogram by counting the number of times a distinct value repeats itself and the bucket saving space of the height balanced histogram by collapsing all instances of the same distinct value into a single bucket without increasing the original bucket number. The hybrid histogram generates more popular values and therefore a greater chance for the Oracle Optimizer to come up with an optimal execution plan.

Code for the examples: InsT1

Register for Simple Talk

12c Hybrid histogram

Natural or Adjusted?

The method

Popular or non-popular?

Cardinality estimation

Popular value

Non-Popular value with an endpoint number

Non-Popular value without an endpoint number

Instability

Conclusion

About the author

Houri's contributions

Articles

Books

Top topics

Houri's latest contributions:

Oracle Partitioned Index Efficiency

Oracle Data Type Implicit Conversion Hierarchy

12c Histogram: TOP-Frequency

Natural or Adjusted?

The method

Popular or non-popular?

Cardinality estimation

Popular value

Non-Popular value with an endpoint number

Non-Popular value without an endpoint number

Instability

Conclusion

Recommended

About the author

Houri's contributions

Articles

Books

Top topics

Houri's latest contributions:

Oracle Partitioned Index Efficiency

Oracle Data Type Implicit Conversion Hierarchy

12c Histogram: TOP-Frequency