I’ve closed the introduction to Dynamic Sampling with an odd test case result: Although all table data was read by Dynamic Sampling the cardinality estimate of the optimizer was still way off – not quite what one might expect.
The reason for this odd behaviour can be explained by the fact that Dynamic Sampling behaves differently depending on whether statistics have been gathered on an object or not. I’ll come back to that point later, but first let’s review how the basic configuration of Dynamic Sampling is performed.
I’ve decided to add this part to the series, because the official documentation lacks some details about how to control the Dynamic Sampling activity.
Dynamic Sampling Levels On Session / Cursor Level
As already shown in one of the introductory test cases, the parameter OPTIMIZER_DYNAMIC_SAMPLING controls the system respectively session setting of Dynamic Sampling. There are 11 levels from 0 to 10, where 0 means no Dynamic Sampling at all and 10 means reading the whole segment.
The levels in between have different meanings, and this is where it gets a bit tricky: Depending on how the level is specified it has different implications.
Why? In addition to the OPTIMIZER_DYNAMIC_SAMPLING system / session setting, there is a DYNAMIC_SAMPLING hint that can be used at statement level to control Dynamic Sampling.
However, the DYNAMIC_SAMPLING hint comes in two flavours: The first one on statement level (sometimes also called “cursor” level) corresponds to the OPTIMIZER_DYNAMIC_SAMPLING system / session setting, so the levels have the same meaning.
SELECT /*+ DYNAMIC_SAMPLING(<n>) */ ...
Here <n> represents the Dynamic Sampling level that should be used for that particular statement, no matter what has been defined on system / session level.
There is a second form of the DYNAMIC_SAMPLING hint that specifies in addition to the level the alias, and it is this form of Dynamic Sampling where the level has a completely different meaning as you’ll see in the second part of this post.
SELECT /*+ DYNAMIC_SAMPLING(<alias>, <n>) */ ...
Here the <alias> needs to correspond to a valid table alias used as part of the statement. This variation of the hint instructs the optimizer about Dynamic Sampling on a particular table.
Let’s first have a look at the different levels and their meaning that is applicable to OPTIMIZER_DYNAMIC_SAMPLING and the DYNAMIC_SAMPLING “cursor” level hint. Note that the levels in this case control both, the triggering events of Dynamic Sampling and the amount of data sampled.
As I write this post the latest incarnation of the official documentation for that can be found here [link deprecated], unfortunately it seems to be partially incorrect and / or incomplete. According to some tests performed I’ve found the following levels:
|0||Disable Dynamic Sampling||0|
|1||Perform Dynamic Sampling only if
– A table is missing statistics
– The table is involved in a join (in the same query block)
– There is no index on this table
– This table has more blocks below the High Water Mark (HWM) than will be
|2||Perform Dynamic Sampling for all tables with missing statistics||64|
|3||Perform Dynamic Sampling for all tables that:
– Meet level 2 criteria
– Or use an expression that leads the optimizer to guess (assume default hard-coded selectivity) for example a predicate like SUBSTR(COL, 1, 10) = ‘BLA’
|4||Perform Dynamic Sampling for all tables that:
– Meet level 3 criteria
– Or have more than one predicate applied to it (complex predicate / possibly correlated column values)
|5||The same as level 4||64|
|6||The same as level 4||128|
|7||The same as level 4||256|
|8||The same as level 4||1024|
|9||The same as level 4||4096|
|10||The same as level 4||4294967295|
If you look closely you’ll notice that there are some subtle differences between this table and the official documentation.
Note: The OPTIMIZER_FEATURES_ENABLE setting (sometimes abbreviated OFE) setting controls the compatibility of the Cost-Based Optimizer. In recent releases this can be set individually on session and even statement level (there is a corresponding hint OPTIMIZER_FEATURES_ENABLE). The value corresponds to one of the valid versions supported by the optimizer code, as of version 220.127.116.11 this is an impressive range of values starting with version 8.0.0 up to current version, sometimes even differentiating the 5th version digit (18.104.22.168.1 vs. 22.214.171.124). This setting acts as an umbrella for all the parameters controlling the optimizer (see this recent post by Jonathan Lewis for more information) including the various bug fix controls that can be seen from V$SYSTEM_FIX_CONTROL and V$SESSION_FIX_CONTROL.
If you set OPTIMIZER_FEATURES_ENABLE to versions below 9.2.0, level 0 will be the default (simply because no Dynamic Sampling existed in those versions). For version 9.2.0.x level 1 will be the default, and from 10g on the default will be level 2.
The level 1 used to be the default when Dynamic Sampling was introduced in release 9iR2 (9.2.0). The rationale behind the rather obscure rules when Dynamic Sampling is triggered at this level was probably related to the fact that back in those days the overhead of Dynamic Sampling was considered much bigger than today and a rather simple query that didn’t involve a join to a table without statistics or that used to have an index on such a table should not incur the additional cost at optimization time.
Level 2 is a fairly straightforward and will run Dynamic Sampling on any table that has missing statistics. If a table has statistics it won’t be subject to Dynamic Sampling with this level.
Things become more interesting at level 3. Starting with this level, Dynamic Sampling will also be triggered in the presence of statistics. The optimizer with default column statistics doesn’t have a clue about the selectivity of expressions. So most kind of expressions like TRUNC(DATECOL), UPPER(CHARCOL) or ROUND(NUMCOL) will cause the optimizer to revert to built-in, hard-coded selectivity defaults that are mostly based on percentages like 1% or 5%. If the optimizer detects such an expression that results in such a guess it will trigger Dynamic Sampling in the hope of getting a better selectivity estimate of the expressions used.
Note: Oracle supports so called “Function-Based Indexes” (sometimes abbreviated FBI) that are indexes based on such expressions. These FBIs cause Oracle to add a hidden, virtual column (these can be seen in the dictionary views DBA/ALL/USER_TAB_COLS) based on the expression used. When gathering statistics, Oracle, by default, maintains statistics for these hidden columns as well. This allows the optimizer to make use of these hidden column statistics and to come up with potentially improved cardinality estimates for the corresponding expressions. From 11g onwards the same can be achieved by using so called “Extended Statistics” or “Virtual Columns”. Both are based on the same internal virtual (hidden) column technology, but they don’t require a Function-Based Index. In previous versions the index is necessary to get the benefit of the virtual column information. Of course this also means overhead when manipulating data and also the possibility of the optimizer using the index inappropriately.
Level 4 again can trigger Dynamic Sampling even in the presence of statistics, because it attempts to cover cases that cannot be dealt with conventional statistics. In this particular case the focus is on correlated column values. By default the optimizer assumes that the selectivity of predicates will apply independently from each other, so that the selectivity of multiple predicates ANDed together on the same table will simply be multiplied. But quite often the column values are related and not independent from each other and therefore the basic assumption of independence is violated. This leads to wrong selectivity estimates (resulting in cardinality underestimations), because these multiple filter predicates do not reduce the cardinality as expected. Think for example of a filter on “brand” and “model” (brand = ‘BMW’ and model = ‘X3’). In such cases filtering on something like the “brand” on top of the “model” often doesn’t increase the selectivity, because usually a model is only made by a single brand. Also other complex predicates on the same table that include ANDs and ORs will trigger Dynamic Sampling on level 4.
The objective here is clear: By applying the actual predicates on a subset of the table data a hopefully more realistic selectivity estimate should be possible.
Note: Since Oracle 11g the optimizer can be made aware of correlated column values by the means of extended statistics that cover groups of columns. If multiple predicates are applied to the same table and a suitable column group is found in the statistics Oracle will use the column group instead of the individual column statistics. However at present there are various restrictions to this implementation: It can only be used for equal predicates, and the column group cannot contain any expressions, so a mixture of expressions and correlated column values cannot be covered. Also any non-equal predicates (for example “zip_code between 10000 and 20000 and city = ‘Ziggy’” will disable the usage of column groups. Such complex predicates can only be covered by Dynamic Sampling
Level 5 to 10 are straightforward again: They correspond to level 4 but increase the number of blocks sampled up to level 10 where more than 4 billion blocks (2^32) will be sampled, which for most segments nowadays means a full sample (a single tablespace at present can hold at most approximately 4 billion blocks which means a single partition of a segment cannot exceed this size, representing 128 TB when using the maximum block size of 32KB).
Note: The constant 32 that is used all over the place to determine the number of blocks sampled is controlled via the undocumented parameter “_optimizer_dyn_smp_blks” which defaults to 32. Also the number of blocks sampled is independent from the default block size used, which means that for larger block sizes more data will be sampled (and vice-versa: for smaller block sizes less data will be sampled).
The levels 3 and 4 give an idea about the true power of Dynamic Sampling: It might allow the optimizer to base its decisions on improved cardinality estimates in case of complex filter expressions. The closer the cardinality estimates to the reality the more likely the resulting execution plan will be good. Note that Dynamic Sampling at present is limited to filter predicates on single tables. It won’t be used by the optimizer for sampling join cardinalities, although the optimizer can use it to obtain column statistics that will be used for calculating join cardinalities (number of distinct values and nulls)
In the next part I’ll have a look at the table level DYNAMIC SAMPLING hint and will also explain why there are some numbers in brackets in the last table column above…