{"id":73145,"date":"2015-10-30T09:49:57","date_gmt":"2015-10-30T09:49:57","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/uncategorized\/basics-of-the-cost-based-optimizer-part-5\/"},"modified":"2021-07-14T13:07:20","modified_gmt":"2021-07-14T13:07:20","slug":"basics-of-the-cost-based-optimizer-part-5","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/oracle-databases\/basics-of-the-cost-based-optimizer-part-5\/","title":{"rendered":"Basics of the Cost Based Optimizer &#8211; Part 5"},"content":{"rendered":"<p>It&#8217;s been three months since I last published anything in this series on the Cost Based Optimizer, so a thumbnail sketch of the previous installments seems to be in order:<\/p>\n<ul>\n<li>In <a href=\"https:\/\/allthingsoracle.com\/basics-of-the-cost-based-optimizer-pt-1\/\"><em><strong>part 1<\/strong><\/em><\/a> I gave an informal overview of how the optimizer would &#8220;think&#8221; about a simple query.<\/li>\n<li>In <a href=\"https:\/\/allthingsoracle.com\/basics-of-the-cost-based-optimizer-part-2\/\"><em><strong>part 2<\/strong><\/em><\/a> I created a data set, and ran a few single-table queries that displayed data patterns, highlighted one of the problems that the optimizer has estimating the resources needed to handle a query, and supplied an official corection to a common optimizer problem.<\/li>\n<li>In <em><strong><a href=\"https:\/\/allthingsoracle.com\/basics-of-the-cost-based-optimizer-part-3\/\">part 3<\/a><\/strong><\/em> I showed how the optimizer&#8217;s basic algorithms for hash joins and nested loops can easily give the better plan a higher cost than the worse plan thanks to the optimizer&#8217;s ignorance of caching effects.<\/li>\n<li>In <em><strong><a href=\"https:\/\/allthingsoracle.com\/basics-of-the-cost-based-optimizer-part-4\/\">part 4 <\/a><\/strong><\/em>I introduced the &#8220;unit of optimisation&#8221; &#8211; a simple join between tables &#8211; and explained how the optimizer will attempt to transform a complex query into something simpler &#8211; typically with fewer query blocks &#8211; before optimizing query blocks and estimating the effects of combining those query blocks. The example of the &#8220;filter subquery&#8221; showed two ways in which the optimizer&#8217;s estimates could go wrong.<em><strong><br \/>\n<\/strong><\/em><\/li>\n<\/ul>\n<p>In this installment I&#8217;ll be looking at other ways in which the optimizer can choose poor execution plans and giving some pointers on how to spot the source of the error quickly, and a couple of strategies for addressing some of these errors.<\/p>\n<h3>Sources of errors<\/h3>\n<p>We\u2019ve already seen that the basic cost of a nested loop join is <em>\u201cnumber of rows selected from first table\u201d<\/em> * <em>\u201ccost of selecting one related set of rows from second table\u201d<\/em>. We\u2019ve noted that for an index the <em><strong>clustering_factor<\/strong><\/em> has a great impact on the cost of selecting the nominal <em>\u201cone related set\u201d<\/em>, and that 12c (with a backport to version 11.2.0.4) has given us a mechanism to tell Oracle how to derive a more realistic value for the <em><strong>clustering_factor<\/strong><\/em>. Given the multiplication, though, we also need to know a little more about how Oracle can produce a bad estimate for the <em>\u201cnumber of rows selected\u201d<\/em> and what we can do about helping the optimizer to get a better estimate; so we need to start thinking about <em><strong>selectivity<\/strong><\/em>.<\/p>\n<blockquote>\n<h3 style=\"padding-left: 30px\"><em>Cardinality and Selectivity<\/em><\/h3>\n<p style=\"padding-left: 30px\"><em>Although, as users of data, we tend to think in terms of \u201cnumber of rows\u201d (the cardinality) most of the critical arithmetic the optimizer uses is about the \u201cfraction of rows\u201d (the selectivity). If we execute a query like \u201cselect * from all_objects where type = \u2018SYNONYM\u201d we might see a plan where the \u201cRows\u201d (cardinality) column predicts 2,365, but under the covers the optimizer has been working with the fraction (selectivity) 1\/30, and has applied that to estimate it has of 70,963 rows being in the table.\u00a0 <\/em><\/p>\n<\/blockquote>\n<p>The big problem with selectivity is that Oracle often doesn\u2019t have a well-informed mechanism for working out the selectivity it should apply at each step of an execution plan. If this is the case then it has one of two options: one option is simply to guess and use a fairly arbitrary fraction, the second is to apply a rational statistical approach that isn\u2019t necessarily appropriate for your data.<\/p>\n<p>Here\u2019s a little script to generate some data that we can use to demonstrate a few of the problems. It\u2019s worth running the examples I\u2019m going to show (and trying variations on the theme) to get a better feeling of how easy it is to spot the underlying cause of a bad cardinality estimates once you\u2019ve had a little practice. I\u2019ve used 11.2.0.4 in all the following cases.<\/p>\n<pre>create table t1\r\nnologging\r\nas\r\nwith generator as (\r\n\tselect\t--+ materialize\r\n\t\trownum id \r\n\tfrom dual \r\n\tconnect by \r\n\t\tlevel &lt;= 1e4\r\n)\r\nselect\r\n\trownum\t\t\t\t\tid,\r\n\tmod(rownum-1,200)\t\t\tmod_200,\r\n\tmod(rownum-1,200) + \r\n\t\ttrunc(dbms_random.value(0,5))\tmod_200_fudge,\r\n\ttrunc((rownum-1)\/5)\t\t\ttrunc_5,\r\n\t(sysdate - 2e5\/86400) +\r\n\t\ttrunc((rownum-1)\/5) \/ 86400\tfive_per_sec,\r\n\ttrunc(sqrt(rownum-1))\t\t\tskewed\r\nfrom\r\n\tgenerator\tv1,\r\n\tgenerator\tv2\r\nwhere\r\n\trownum &lt;= 1e6\r\n;\r\n\r\nbegin\r\n        dbms_stats.gather_table_stats(\r\n                owner            =&gt; user,\r\n\t\ttabname\t\t =&gt;'T1',\r\n\t\tmethod_opt\t =&gt; 'for all columns size 1'\r\n\t);\r\nend;\r\n\/\r\n\r\n<\/pre>\n<p>I\u2019ve generated 1 million rows \u2013 it took just a few seconds to do this \u2013 the <em><strong>mod_200<\/strong><\/em> column holds 200 distinct values (5,000 rows per value) evenly spread throughout the table, the <em><strong>mod_200_fudge<\/strong><\/em> column is closely correlated with <em><strong>mod_200<\/strong><\/em>, but I\u2019ve introduced a small random adjustment so that for each value of <em><strong>mod_200<\/strong><\/em> the column <em><strong>mod_200_fudge<\/strong><\/em> has roughly 1,000 rows for each of 5 consective values starting from the <em><strong>mod_200<\/strong><\/em> value.<\/p>\n<p>The <em><strong>trunc_5<\/strong><\/em> column holds 200,000 distinct values, clustered in groups of 5 rows with the same value, starting at 0, ending with 199,999. This 5 row cluster has then been repeated into the date-based <em><strong>five_per_sec<\/strong><\/em> column which starts 200,000 seconds into the past and holds 5 rows per second for 200,000 seconds.<\/p>\n<p>The final column is for some very skewed data, generated using the square root function. Thanks to the <em><strong>trunc()<\/strong><\/em> call it holds 1,000 distinct values (from 0 to 999); there\u2019s just one row with the value 0, 3 with the value 1, 5 with the value 2, and so on, (2N + 1 rows with the value N) until we get to 1999 rows with the value 999. Despite the uneven data distribution I\u2019ve only gathered simple stats on this column: in this installment I\u2019m not going to say anything about the effects of histograms.<\/p>\n<h3>Demonstrations<\/h3>\n<p>I\u2019m going to run a number of queries and then pull the execution plan from memory after doing so. I\u2019ve used the following framework, which I\u2019ll only show for the first example:<\/p>\n<pre>set pagesize 60\r\nset linesize 180\r\nset trimspool on\r\nset serveroutput off\r\nselect count(*) from t1 where sign(mod_200) = 1;\r\nselect * from table(dbms_xplan.display_cursor);\r\n\r\n<\/pre>\n<p>Here\u2019s the complete list of queries we&#8217;ll be examining over the next couple of articles:<\/p>\n<pre>select count(*) from t1 where sign(mod_200) = 1;\r\n\r\ncolumn new_date new_value m_date\r\nselect to_char(sysdate,'dd-mon-yyyy') new_date from dual;\r\n\r\nselect count(*) from t1 \r\nwhere trunc(five_per_sec) &gt; to_date('&amp;m_date','dd-mon-yyyy');\r\n\r\nselect count(*) from t1 where mod_200 = mod_200_fudge;\r\n\r\nselect count(*) from t1 where mod_200 = 100 and mod_200_fudge = 100;\r\n\r\nselect count(*) from t1 where trunc_5 &gt; (\r\n\tselect max(trunc_5) from t1 where mod_200 = 0\r\n);\r\n\r\nselect count(*) from t1 where five_per_sec &gt; sysdate - 60\/86400;\r\n\r\nexecute dbms_lock.sleep(10)\r\nselect count(*) from t1 where five_per_sec &gt; sysdate - 10\/86400;\r\n\r\n\r\nvariable b1 number\r\nexec :b1 := 10\r\nselect count(*) from t1 where skewed = :b1;\r\n\r\nexec :b1 := 100\r\nselect count(*) from t1 where skewed = :b1;\r\n\r\n<\/pre>\n<p>I\u2019ll report the basic execution plan for each query and give a brief description of what the query is trying to demonstrate, in individual sections below.<\/p>\n<h3>function(column) = constant<\/h3>\n<p>The optimizer doesn\u2019t know what the effect of a function will be. We can look at the function and the input data and, in this case, we know that most of the data will satisfy the predicate since <em><strong>sign(mod_200)<\/strong><\/em> will be 0 only when mod_200 is zero and it will be 1 for the rest of the data. The query returns the result 995,000, but the plan predicts 10,000. This is a fixed guess of 1%:<\/p>\n<pre>---------------------------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |\r\n---------------------------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   660 (100)|          |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |\r\n|*  2 |   TABLE ACCESS FULL| T1   | 10000 | 40000 |   660  (10)| 00:00:04 |\r\n---------------------------------------------------------------------------\r\n<\/pre>\n<p>Fortunately we can work around this type of problem with a <strong>virtual column<\/strong> if we decided that the estimate is bad enough to be a threat:<\/p>\n<pre>alter table t1\r\nadd (\r\n        sign_mod_200 generated always as ( sign(mod_200) ) virtual\r\n)\r\n;\r\n\r\nbegin\r\n\tdbms_stats.gather_table_stats(\r\n\t\townname    =&gt; user,\r\n\t\ttabname    =&gt;'T1',\r\n\t\tmethod_opt =&gt; 'for columns sign_mod_200 size 3'\r\n\t);\r\nend;\r\n\/\r\n\r\n<\/pre>\n<p>It\u2019s important to note that when you create a virtual column you need to collect statistics on that column before it will have the effect you expect \u2013 though you may still get lucky if you forget. You can add virtual columns until you hit the standard limit of 1,000 columns in total for the table \u2013 but don\u2019t be too hasty to add virtual columns to solve every problem, remember that Oracle will be collecting stats on all those columns during its automatic stats collection job. (And if you\u2019re using partitioned tables and doing everything on full automatic with incremental statistics then you\u2019ll be creating and storing synopses as well).<\/p>\n<p>I\u2019ve gathered stats with <em>\u201cfor columns XXX size 3\u201d<\/em> in this case because the <em><strong>sign()<\/strong><\/em> function can only return 3 possible values (-1, 0, 1) and since I\u2019m expecting a massive data skew it may help me to have a frequency histogram on the column.<\/p>\n<p>With the virtual column in place \u2013 and no change to the code \u2013 the optimizer\u2019s estimate for the number of rows returned jumped to the correct value.<\/p>\n<p>With a slightly different requirement \u2013 what if I\u2019m interested only in the small number of rows where the sign() is zero \u2013 I might choose to create a different virtual column and index it:<\/p>\n<pre>alter table t1\r\nadd (\r\n      mod_200_zeroes generated always as ( case sign(mod_200) when 0 then 0 end )\r\n\t virtual\r\n)\r\n;\r\n\r\nbegin\r\n\tdbms_stats.gather_table_stats(\r\n\t\townname    =&gt; user,\r\n\t\ttabname    =&gt;'T1',\r\n\t\tmethod_opt =&gt; 'for columns mod_200_zeroes size'\r\n\t);\r\nend;\r\n\/\r\n\r\ncreate index t1_i1 on t1(mod_200_zeroes) nologging;\r\n\r\nselect * from t1 where mod_200_zeroes = 0;\r\n\r\n<\/pre>\n<p>This gives us an index that is as small as possible \u2013 it contains entries only for the rows that we are interested in. This demonstrates a useful principle, even though in this particular case it\u2019s neither sensible nor useful as there\u2019s too much relevant data for the index to be effective, and even if the target data set was much smaller we wouldn\u2019t index the <em><strong>sign()<\/strong><\/em> function we would just create a function-based index to identify the rows where <em><strong>mod_200<\/strong><\/em> itself was zero.<\/p>\n<h2>Function(column) &gt; constant<\/h2>\n<p>In this case I\u2019ve copied the current date into a substitution variable before using it. As before the optimizer has no idea of the effect of the function on the column.<\/p>\n<pre>column new_date new_value m_date\r\nselect to_char(sysdate,'dd-mon-yyyy') new_date from dual;\r\n\r\nselect count(*) from t1 \r\nwhere  trunc(five_per_sec) &gt; to_date('&amp;m_date','dd-mon-yyyy');\r\n\r\n<\/pre>\n<p>Because I created the data by working backwards from sysdate this query will find no matching rows \u2013 but the optimizer guesses otherwise:<\/p>\n<pre>---------------------------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |\r\n---------------------------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   812 (100)|          |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |\r\n|*  2 |   TABLE ACCESS FULL| T1   | 50000 |   390K|   812  (27)| 00:00:05 |\r\n---------------------------------------------------------------------------\r\n<\/pre>\n<p>The guess is 5%. The same would be true for variations on this range-based predicate while a BETWEEN predicate would report a selectivity of 0.25% (which, not coincidentally, happens to be 5% of 5%). The same 5% (and, mutatis mutandis, 0.25%) appears with predicates of the form <em>\u201ccolumn &gt; {unknown value}\u201d<\/em>.<\/p>\n<p>Again we can add a virtual column and gather stats \u2013 though in an example of this type, but with a much larger date range, a histogram isn\u2019t likely to be appropriate:<\/p>\n<pre>alter table t1\r\nadd (\r\n        trunc_five_per_sec generated always as ( trunc(five_per_sec) ) virtual\r\n)\r\n;\r\n<\/pre>\n<p>Again, with no change to the query text, the predicted number of rows changes. Unfortunately this specific example now demonstrates a different problem.<\/p>\n<pre>---------------------------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |\r\n---------------------------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   812 (100)|          |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |\r\n|*  2 |   TABLE ACCESS FULL| T1   |   333K|  2604K|   812  (27)| 00:00:05 |\r\n---------------------------------------------------------------------------\r\n<\/pre>\n<p>The optimizer now thinks we will get 333,000 rows when we know we will get no data. The problem (that we will see again in the next installment) is that our predicate is outside the known range (low\/high) values of the virtual column stats, so the optimizer has used <em><strong>1\/num_distinct<\/strong><\/em> as the selectivity \u2013 which is another unfortunate guess about what it should do when the predicates are not \u201cnice\u201d. Since I had only 3 distinct values for <em><strong>trunc_five_per_sec<\/strong><\/em> the predicted row count is 1,000,000\/3 which is the 333,333 that you see.<\/p>\n<h3>Comparing columns<\/h3>\n<p>Although it may seem a little unrealistic to examine a query where we compare two columns in the same table, the principle I\u2019m about to demonstrate applies to comparing columns across tables in join queries.<\/p>\n<pre>select count(*) from t1 where mod_200 = mod_200_fudge;\r\n<\/pre>\n<p>We know that for any one value of <em><strong>mod_200<\/strong><\/em> there will be roughly 1,000 rows with the same value for <em><strong>mod_200_fudge<\/strong><\/em> and 4,000 spread fairly evenly across the next 4 consecutive values. Since there are 200 distinct values involved we expect the resulting count to be roughly 200,000 \u2013 but here\u2019s the optimizer\u2019s estimate:<\/p>\n<pre>---------------------------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |\r\n---------------------------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   654 (100)|          |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |\r\n|*  2 |   TABLE ACCESS FULL| T1   |  4902 | 39216 |   654   (9)| 00:00:04 |\r\n---------------------------------------------------------------------------\r\n<\/pre>\n<p>The rationale here is that the optimizer knows that there are 200 distinct values for mod_200 and 204 distinct values for mod_200_fudge, so it has used the larger number of distinct values (lower value for selectivity) to calculate the expected cardinality. In effect it has considered the two predicates \u201cmod_200 = {constant}\u201d and \u201cmod_200_fudge = {constant}\u201d and used the arithmetic for the predicate that will return the smaller number of rows: 1,000,000 \/ 204 = 4901.96.<\/p>\n<p>In this case you might have to depend on dynamic sampling to get a better estimate \u2013 except virtual columns can still give you a way around this (specific) problem. If <em>mod_200 = mod_200_fudge<\/em> then <em>mod_200_fudge \u2013 mod_200 = 0<\/em>: so let\u2019s create that virtual column, gather stats, and modify the query. Note that in this case I expect the virtual column to hold 5 distinct values of about 200,000 rows each \u2013 in other cases where the data follows a more interesting pattern you might want to create a histogram, or even introduce a more sophisticated virtual column (and matching code change) in order to take advantage of some precision indexing.<\/p>\n<pre>alter table t1\r\nadd (\r\n        mod_200_diff generated always as ( mod_200_fudge - mod_200 ) virtual\r\n)\r\n;\r\n\r\nPLAN_TABLE_OUTPUT\r\n-------------------------------------\r\nSQL_ID  8vrjgbqkqds8u, child number 0\r\n-------------------------------------\r\nselect count(*) from t1 where mod_200_diff = 0\r\n\r\n---------------------------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |\r\n---------------------------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   665 (100)|          |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |\r\n|*  2 |   TABLE ACCESS FULL| T1   |   200K|   585K|   665  (10)| 00:00:04 |\r\n---------------------------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   2 - filter(\"MOD_200_FUDGE\"-\"MOD_200\"=0)\r\n\r\n<\/pre>\n<p>With the column (and stats) and change in code the optimizer has correctly predicted 200,000 rows for the query. (I\u2019ve included the query and predicate section of the dbms_xplan output here to show that while I used the virtual column for the query Oracle has managed to display the underlying columns as the meaning of the predicate).<\/p>\n<p>It\u2019s worth noting that adding virtual columns can expose badly written applications to a performance threat, or even a risk of crashing. First, if your application contains code that uses the <em>\u201cselect *\u201d<\/em> notation then it will select the virtual columns: this may simply increase the performance costs of data movement across a network, but if the code isn\u2019t designed to deal with an open-ended list of columns you may find the code crashes because you don\u2019t have enough receiving variables for the select list. Secondly if your application has code that does <em>\u201cinsert into tableX values()\u201d<\/em> without specifying a list of table columns then it will crash with Oracle error \u201c<em>ORA-00947: not enough values\u201d<\/em>. We can bypass both problems in 12c because we can declare the virtual columns to be invisible.<\/p>\n<h2>Combining Columns<\/h2>\n<p>This is a very well-known problem. Oracle assumes that there is no correlation between the different columns in a table, and derives the selectivity for ANDs and ORs of simple predicates using the rules of probability theory for independent variables which I\u2019ll describe after showing you some results:<\/p>\n<pre>select count(*) from t1 where mod_200 = 100 and mod_200_fudge = 100\r\n---------------------------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |\r\n---------------------------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   654 (100)|          |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     8 |            |          |\r\n|*  2 |   TABLE ACCESS FULL| T1   |    25 |   200 |   654   (9)| 00:00:04 |\r\n---------------------------------------------------------------------------\r\n<\/pre>\n<p>We know that of the 5,000 rows where mod_200 = 100 roughly 1,000 rows will have mod_200_fudge = 100; so the optimizer\u2019s estimate is far too low. This is because the optimizer doesn\u2019t have any information to tell it that the two columns are closely related. Its arithmetic has said:<\/p>\n<blockquote>\n<p><i>1 in 200 rows will have mod_200 = 100, and 1 in 200 rows will have mod_200_fudge = 100, so one row in 200 * 200 will report both predicates true, which gives 1,000,000\/40,000 = 25 rows.<br \/>\n<\/i><\/p>\n<\/blockquote>\n<p>Oracle allows us to create column groups to address this type of problem \u2013 they are only effective when all the predicates involved use equality, and they cease to be effective if the individual columns have histograms (but see <a href=\"https:\/\/antognini.ch\/2014\/02\/extension-bypassed-because-of-missing-histogram\/\"><em><strong>Chris Antognini&#8217;s note, especially the addendum<\/strong><\/em><\/a>) or if the predicates go outside the low\/high range for any of the columns.<\/p>\n<pre>begin\r\n        dbms_stats.gather_table_stats(\r\n                ownname         =&gt; user,\r\n                tabname         =&gt; 'T1',\r\n                method_opt      =&gt; 'for columns (mod_200, mod_200_fudge) size 1'\r\n        );\r\nend;\r\n\/\r\n\r\n-----------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost  |\r\n-----------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   727 |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     8 |       |\r\n|*  2 |   TABLE ACCESS FULL| T1   |  1000 |  8000 |   727 |\r\n-----------------------------------------------------------\r\n<\/pre>\n<p>A quick check of view user_tab_cols shows us that we have a new (hidden, system-generated) column called SYS_STUHEAW8XLV$A#9LXJ65QQJU13 holding 1,000 distinct values. Internally has created a virtual column using a hash function called <em><strong>sys_op_combined_hash()<\/strong><\/em>, and then calculated the selectivity for the predicate: <em>&#8220;sys_op_combined_hash(mod_200, mod_200_fudge) = sys_op_combined_hash(100,100)&#8221;<\/em>.<\/p>\n<p>You can create a maximum of 20 \u201ccolumn groups\u201d like this \u2013 so don\u2019t waste the limited scope the option gives you. In this particular case there is an alternative that would allow us to avoiding \u201cusing up\u201d one of the limited supply. We\u2019ve already created a virtual column <em><strong>mod_200_diff<\/strong><\/em> based on <em><strong>(mod_200_fudge \u2013 mod_200)<\/strong><\/em>, look what happens if we rewrite our query:<\/p>\n<pre>PLAN_TABLE_OUTPUT\r\n-------------------------------------\r\nSQL_ID  dgdusd4tad9nv, child number 0\r\n-------------------------------------\r\nselect count(*) from t1 where mod_200 = 100 and mod_200_diff = 0\r\n-----------------------------------------------------------\r\n| Id  | Operation          | Name | Rows  | Bytes | Cost  |\r\n-----------------------------------------------------------\r\n|   0 | SELECT STATEMENT   |      |       |       |   727 |\r\n|   1 |  SORT AGGREGATE    |      |     1 |     7 |       |\r\n|*  2 |   TABLE ACCESS FULL| T1   |  1000 |  7000 |   727 |\r\n-----------------------------------------------------------\r\n\r\nPredicate Information (identified by operation id):\r\n---------------------------------------------------\r\n   2 - filter((\"MOD_200\"=100 AND \"MOD_200_FUDGE\"-\"MOD_200\"=0))\r\n<\/pre>\n<p>The query is logically identical to the original, and we\u2019ve got the correct cardinality estimate by using a virtual column rather than a column group. The arithmetic has worked as follows:<\/p>\n<ul>\n<li>Selectivity of (mod_200 = 100) is 1\/200<\/li>\n<li>Selectivity of (mod_200_diff) = 0 is 1\/5<\/li>\n<li>Selectivity of (mod_200 = 100 and mod_200_diff = 0) is 1\/200 * 1\/5 = 1\/1000.<\/li>\n<\/ul>\n<p>Because of the carefully constructured patterns I created in the data, these numbers happen to be a valid representation of the data. You\u2019re not likely to find such a perfect model in real data, but you may spot occasions where something like this is a good enough approximation to allow you to give the optimizer a much better idea of cardinality than it would otherwise have.<\/p>\n<p>A word of warning when upgrading to 12c. If you leave all the automatic optimization features enabled Oracle is capable of creating, and gathering stats on, column groups without giving you any indication that it has done so. This is a side effect of the cardinality feedback, dynamic statistics, and SQL Directive mechanisms. Since you can only have 20 sets of extended stats per table you may need to keep a close eye on the views <em><strong>user_tab_cols<\/strong><\/em> and <em><strong>user_stat_extensions<\/strong><\/em> to make sure that you know what column groups have been created on your tables.<\/p>\n<h3>Conclusion<\/h3>\n<p>This article has already gone on too long and there are still several cases I haven\u2019t yet covered where the optimizer has problems getting reasonable cardinality estimates. The cases we have covered, though, can often be handled very nicely by creating virtual columns or the <em>\u201ccolumn group\u201d<\/em> variation of extended statistics.<\/p>\n<p>In the next article I\u2019ll carry on working through the list of problems introduced at the start of this article, where we\u2019ll find that there are some problems that are much harder to deal with.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"http:\/\/jonathanlewis.wordpress.com\/cbo-series\/\">\u2013&gt; Catalogue of current articles in CBO series.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>It&#8217;s been three months since I last published anything in this series on the Cost Based Optimizer, so a thumbnail sketch of the previous installments seems to be in order: In part 1 I gave an informal overview of how the optimizer would &#8220;think&#8221; about a simple query. In part 2 I created a data set, and ran a few&hellip;<\/p>\n","protected":false},"author":101205,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143533],"tags":[48348,48383],"coauthors":[],"class_list":["post-73145","post","type-post","status-publish","format-standard","hentry","category-oracle-databases","tag-12c-cardinality","tag-cost-based-optimizer"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73145","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/101205"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=73145"}],"version-history":[{"count":1,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73145\/revisions"}],"predecessor-version":[{"id":91641,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/73145\/revisions\/91641"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=73145"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=73145"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=73145"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=73145"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}