Dynamic Sampling (III) – Real-Life Data

In the previous parts of the series I outlined the basics on Dynamic Sampling. Let’s see how Dynamic Sampling deals with typical real-life data patterns.

As you might be aware, quite often real data follows certain patterns for how data is clustered together, usually related to how the data arrives. This clustering (or scattering) of data within a table can be influenced by the physical organisation of the table, but in the most commonly used Heap Organized Table in Oracle the data can be stored generally in any table block and hence usually follows the natural pattern dictated by the data arrival.

Furthermore quite often the most interesting data is the latest data, and the data your queries might be interested in is usually not spread randomly across the table, but clustered together in a specific area of the table.

If you think about how Dynamic Sampling works by randomly sampling – usually only a very limited number of blocks (commonly something between 32 and 256 blocks) – then it might become obvious that problems dealing with such clustered data patterns can arise, because very likely the few blocks sampled won’t contain a representative sub-set of the data we are looking for.

This means that Dynamic Sampling works well if the data you’re looking for is spread randomly across a table, but by default has problems when the data you’re looking for is either clustered together or simply very rare. In both cases chances are high that the blocks sampled don’t contain a representative sub-set of the data you’re looking for, and the sanity checks built into the code will then kick in and reject the Dynamic Sampling results.

Let’s see a simple test case for how Dynamic Sampling deals with such a data pattern.

The Baseline

What I simulate here is a typical problem of correlated column values – I’m looking for the most recent data that is in a certain status that represents “unprocessed” data. Of course, it is very likely that the most recent data is the data that is yet to be processed, so these two questions basically ask for the same information, but in two different ways.

The example is based on the “parcels” example Jonathan Lewis used in his “Statistics” chapter of the book “Expert Oracle Practices” (used with permission).

create table parcels

with generator as (

select --+ materialize

rownum as id

from

all_objects

where

rownum <= 3000

)

select

rownum as id

, sysdate - trunc((1000000 - rownum)/1000,2) as pickup_date

, case

when rownum > 999000

then 'In-transit'

else 'Delivered'

end as delivery_status

, lpad(rownum,10) as small_vc

, lpad('x',150) as padding

from

generator v1

, generator v2

where

rownum <= 1000000

;

exec dbms_stats.gather_table_stats(null, 'parcels', method_opt => 'for all columns size 1 for columns delivery_status size 2')

There are 1,000 parcels per day for the last 1,000 days, and the last 1,000 parcels are marked as “In-transit”.

We are looking for a good cardinality estimate for the following query:

select

from

parcels pcl

where

pcl.pickup_date >= sysdate - 1

and pcl.delivery_status = 'In-transit'

So both filter predicates identify 1,000 rows each out of 1,000,000 and since I have generated a histogram on the highly skewed “delivery_status” column the optimizer is aware of this. Applying the predicates independently from each other shows that the optimizer’s estimates are in the right ballpark.

The problem however is that applying both together by default results in an estimate of 1 out 1,000,000 since the optimizer doesn’t know about the correlation.

explain plan for

select

from

parcels pcl

where

pcl.pickup_date >= sysdate - 1

;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1003 | 181K| 7305 (2)| 00:01:28 |

|* 1 | TABLE ACCESS FULL| PARCELS | 1003 | 181K| 7305 (2)| 00:01:28 |

-----------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("PCL"."PICKUP_DATE">=SYSDATE@!-1)

explain plan for

select

from

parcels pcl

where

pcl.delivery_status = 'In-transit'

;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 828 | 149K| 7231 (1)| 00:01:27 |

|* 1 | TABLE ACCESS FULL| PARCELS | 828 | 149K| 7231 (1)| 00:01:27 |

-----------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("PCL"."DELIVERY_STATUS"='In-transit')

explain plan for

select

from

parcels pcl

where

pcl.pickup_date >= sysdate - 1

and pcl.delivery_status = 'In-transit'

;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 185 | 7231 (1)| 00:01:27 |

|* 1 | TABLE ACCESS FULL| PARCELS | 1 | 185 | 7231 (1)| 00:01:27 |

-----------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1)

Since one of the predicates is based on a non-equal comparison the new “Extended Statistics” of 11g is of no help either, since it works for column groups only if all filters on the column group are based on equal comparisons.

So we are left with Dynamic Sampling. At the session default level of 2 Dynamic Sampling won’t kick in. We need to either increase the session/cursor level (according to the level descriptions, level 4 or higher sound reasonable for correlated column values), or we could ask for an explicit table level Dynamic Sampling. Remember that the number of blocks sampled is defined differently for the “session/cursor” and “table” level Dynamic Sampling (as explained in the previous parts of this series).

Let’s start with session level Dynamic Sampling set to level 4:

set echo on

alter session set tracefile_identifier = 'dyn_samp';

alter session set optimizer_dynamic_sampling = 4;

alter session set events '10053 trace name context forever, level 1';

explain plan for

select

from

parcels pcl

where

pcl.pickup_date >= sysdate - 1

and pcl.delivery_status = 'In-transit'

;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 185 | 7231 (1)| 00:01:27 |

|* 1 | TABLE ACCESS FULL| PARCELS | 1 | 185 | 7231 (1)| 00:01:27 |

-----------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1)

As expected and already outlined above that didn’t really help. There is no notice of Dynamic Sampling in the DBMS_XPLAN.DISPLAY output, although we would expect Dynamic Sampling to kick in under these circumstances.

Let’s have a look at the optimizer trace file:

** Dynamic sampling initial checks returning TRUE (level = 4).

*** 2012-05-29 09:10:29.423

** Generated dynamic sampling query:

query text :

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("PCL") FULL("PCL") NO_PARALLEL_INDEX("PCL") */ 1 AS C1, CASE WHEN "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND "PCL"."DELIVERY_STATUS"='In-transit' THEN 1 ELSE 0 END AS C2 FROM "PARCELS" SAMPLE BLOCK (0.116902 , 1) SEED (1) "PCL") SAMPLESUB

*** 2012-05-29 09:10:29.423

** Executed dynamic sampling query:

level : 4

sample pct. : 0.116902

actual sample size : 912

filtered sample card. : 0

orig. card. : 1000000

block cnt. table stat. : 26518

block cnt. for sampling: 26518

max. sample block cnt. : 32

sample block cnt. : 31

min. sel. est. : 0.00000083

** Not using dynamic sampling for single table sel. or cardinality.

DS Failed for : ----- Current SQL Statement for this session (sql_id=bzacaa129jp2q) -----

So Dynamic Sampling was performed, but the result was rejected due to the fact that the sample size was just 31 blocks and no positive samples were found.

Another try with a higher level that results in a greater sample size, session level 8 in this case:

** Dynamic sampling initial checks returning TRUE (level = 8).

*** 2012-05-29 18:39:56.560

** Generated dynamic sampling query:

query text :

*** 2012-05-29 18:39:57.246

** Executed dynamic sampling query:

level : 8

sample pct. : 3.857757

actual sample size : 36972

filtered sample card. : 0

orig. card. : 1000000

block cnt. table stat. : 26518

block cnt. for sampling: 26518

max. sample block cnt. : 1024

sample block cnt. : 1023

min. sel. est. : 0.00000050

** Not using dynamic sampling for single table sel. or cardinality.

DS Failed for : ----- Current SQL Statement for this session (sql_id=bzacaa129jp2q) -----

We sampled almost four percent of the original table (1023 blocks), but still no luck.

Note that your results might vary, depending on the random blocks sampled – some levels below 10 might produce usable results, most however won’t.

Clearly this looks like bad luck, even higher levels of Dynamic Sampling don’t generate a reasonable estimate, and other options like Extended Statistics won’t help either.

Indexes For Help

But wait, I’ve already mentioned in the previous parts of the series that sometimes Dynamic Sampling can take advantage of indexes. So let’s create such an index:

1	create index parcels_idx on parcels (delivery_status, pickup_date);

Since the index should allow arriving at a very precise cardinality estimate for rare values (less than 2,500 occurrences, more on that later) even with a small sample size let’s repeat the experiment using level 4:

-------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 185 | 4 (0)| 00:00:01 |

| 1 | TABLE ACCESS BY INDEX ROWID| PARCELS | 1 | 185 | 4 (0)| 00:00:01 |

|* 2 | INDEX RANGE SCAN | PARCELS_IDX | 1 | | 3 (0)| 00:00:01 |

-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access("PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND "PCL"."PICKUP_DATE" IS NOT NULL)

Trace file snippet:

** Dynamic sampling initial checks returning TRUE (level = 4).

*** 2012-05-29 18:50:05.158

** Generated dynamic sampling query:

query text :

SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0), NVL(SUM(C3),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("PCL") FULL("PCL") NO_PARALLEL_INDEX("PCL") */ 1 AS C1, CASE WHEN "PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 THEN 1 ELSE 0 END AS C2, CASE WHEN "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND "PCL"."DELIVERY_STATUS"='In-transit' THEN 1 ELSE 0 END AS C3 FROM "PARCELS" SAMPLE BLOCK (0.116902 , 1) SEED (1) "PCL") SAMPLESUB

*** 2012-05-29 18:50:05.173

** Executed dynamic sampling query:

level : 4

sample pct. : 0.116902

actual sample size : 912

filtered sample card. : 0

filtered sample card. (index PARCELS_IDX): 0

orig. card. : 1000000

block cnt. table stat. : 26518

block cnt. for sampling: 26518

max. sample block cnt. : 32

sample block cnt. : 31

min. sel. est. : 0.00000050

** Not using dynamic sampling for single table sel. or cardinality.

DS Failed for : ----- Current SQL Statement for this session (sql_id=bzacaa129jp2q) -----

Unfortunately this doesn’t look much better: We can tell from the line “filtered sample card. (index PARCELS_IDX): 0” that the code recognized the index, but somehow it didn’t really use it.

In the next part of this series we’ll see if the indexes can be of any help and what further options exist, if any.

Go to Dynamic Sampling (II) – Controlling The Activity

Register for Simple Talk

Dynamic Sampling (III) – Real-Life Data – Part I

The Baseline

Indexes For Help

About the author

Randolf Geist's contributions

Articles

Books

Top topics

Randolf Geist's latest contributions:

Parallel Execution Skew – Skew Caused By Outer Joins

Parallel Execution Skew – Addressing Skew Using Manual Rewrites

Parallel Execution Skew – 12c Hybrid Hash Distribution With Skew Detection

The Baseline

Indexes For Help

Recommended

About the author

Randolf Geist's contributions

Articles

Books

Top topics

Randolf Geist's latest contributions:

Parallel Execution Skew – Skew Caused By Outer Joins

Parallel Execution Skew – Addressing Skew Using Manual Rewrites

Parallel Execution Skew – 12c Hybrid Hash Distribution With Skew Detection