Histograms Part 2

In part 1 of this series we discussed the reasons why we might want to create some histograms to help the optimizer deal with skewed data distribution. We used an example of a simple status column in an orders table to demonstrate principles, talked about frequency histograms in particular, and highlighted a couple of problems associated with histograms. In part

In part 1 of this series we discussed the reasons why we might want to create some histograms to help the optimizer deal with skewed data distribution. We used an example of a simple status column in an orders table to demonstrate principles, talked about frequency histograms in particular, and highlighted a couple of problems associated with histograms.

In part 2 we are going to spend some time on the only other type of histogram that is available in versions of Oracle up to 11g, the height-balanced histogram. There are a few patterns of data where a height-based histogram can be useful, and we’ll be looking at those in part 3; but we’re going to start with an example where our data won’t allow Oracle to create a frequency histogram and a height-balanced histogram can become more of a liability than a benefit; and we’ll describe a strategy (that you will no longer need in 12c) for working around this limitation. Along the way we will look at a very simple example demonstrating the algorithm that Oracle uses to generate the contents of a height-based histogram.

A simple example

Here’s a report of some important client data – camouflaged, of course – showing an uneven distribution of values for a particularly important column.

Clearly I have a very uneven distribution of values here, so a histogram might be quite helpful. Unfortunately there are 352 distinct values, which is beyond the limit allowed by Oracle until 12c, so if I generate a histogram it will have to be a height-balanced histogram.

To generate a height-balanced histogram Oracle starts by sorting the data into order then selects the first row and every Nth row (where N = “number of non-null values in column” divided by “number of buckets requested”). With our sample data we would be selecting roughly every 40,000th row from the sorted data. (10M / 254).

Each selected number is referred to as an endpoint value, and the numbers are labelled with numbers (called the endpoint numbers) from 0 to “number of buckets”. Using this algorithm a value that appears sufficiently frequently in the data set will be selected many times, which allows Oracle to recognize it as a “popular” value. When storing the histogram selection, Oracle doesn’t store repetitions of end point values – if a value appears many times in the list it will be stored just once with its highest endpoint number.

To demonstrate the mechanism involved, and the subsequent cardinality arithmetic, I’m going to take the following (ready-sorted) 20 values and build a histogram of 5 buckets:

Take the first and every 4th number (20 / 5 = 4)

Store them, labeling them from zero since we’ve included the lowest original value in our list:

Now eliminate the earlier copies of any duplicated values:

This is what Oracle would store as the height-balanced histogram on the original data set. From this it would infer that 12 was a popular value (it appeared more than once); but, even though we can see that there are just as many 12s as there are 13s, Oracle would not infer that 13 was a popular value. You might like to consider the effect of changing one of the 11s to 16 – if you did this then 12 would cease to appear as a popular value and 13 would become a popular value; conversely if you changed a 16 to 11 then neither 12 nor 13 would appear as popular.

As far as cardinality calculations go, Oracle “counts buckets” for the popular values – in our tiny example we picked every 4th row, our “bucket” size is 4. We know that 12 appears twice in the full histogram, so Oracle calculates the cardinality of the predicate “n1 = 12” as 8 (i.e. 2 buckets * 4 rows per bucket). The method the optimizer uses to calculate the cardinality for the values which have not been recognized as popular has varied over time – in 11.2.0.3 one of the calculations is simply:  (number of non-popular rows)/(number of non-popular values) which works as follows in our example

Popular values 1 (12)
Non-popular values 7 (5,6,9,11,13,16,17)
Popular rows 8 (2 buckets at 4 rows per bucket)
Non-popular rows 12 (20 – 8)

Non-popular cardinality = 12/7 = 1.714, which rounds up to 2.

There are various complications and variations to the calculations, of course; in particular if Oracle has used a sample to build the histogram the optimizer will need to scale up its calculations by a factor that reflects the difference between the number of relevant rows in the table and the number of rows captured in the histogram.

With this small example in mind, let’s consider the 10M rows and 352 values from my client.

Precision matters

We have 10M rows – and the first thing we need to do to create the height-balanced histogram is to sort the data, so the very first step is fairly resource intensive. If we take a sample this would reduce the volume of data sorted – but it’s just possible that the sample would miss so many values that Oracle thought it could create a frequency histogram – and that might result in the optimizer making very low estimates of cardinality for values which actually have tens of thousands of rows.

Having sorted the data, how many of our “popular” values are likely to appear to be popular by the time we do the selection of every 40,000th row? Our most popular value has 1.8M rows, which accounts for 46 buckets, the next three values account for a further 63 buckets, leaving only 155 buckets – and since it takes two buckets to identify a popular value at all we’re only going to be able to identify at most another 72 popular values – everything else will be treated as “average”.

In fact this is one of the really irritating features of the grey area between frequency and height-balanced histograms. A frequency histogram can handle 254 popular values, but if Oracle has to switch to a height-balanced histogram it can’t identify more than 127 (i.e. 254/2) popular values – if your data goes a little over the frequency limit your precision can drop dramatically.

Stability is another problem – in the middle of my list I reported a number of values for which there were about 70,000 rows each. Are these going to be captured in the histogram? Given that a bucket represents 40,000 rows a value that has 40,001 rows can just appear twice in our selection, on the other hand a value that has 79,999 rows might just fail to appear twice. All you have to do is add or delete a few rows each day and values will “randomly” appear and disappear from the histogram each time you gather it. When I analysed the data set in detail I came to the following conclusions – for a 100% sample:

  • There are 25 values that WILL get captured each day (more than 80,000 rows per value)
  • There are 35 values that might get captured one day and disappear the next (40,000 to 80,000 rows)

Height-balanced histograms can be very unstable – if any of those 35 values are the rows that users are really interested in then their execution plans may change dramatically every time the histogram is gathered. And the randomness of the sample makes things worse if we don’t use 100% of the data to gather the histogram.

But here’s another aspect of analysing the data in detail:

  • The top 140 values account for 99% of the data.
  • The top 210 values account for 99.9% of the data
  • The top 250 values account for 99.98% of the data
  • The last 102 values account for about 2,000 rows from 10M

(Remember, by the way, that we were only going to be able to capture at most 81 (77 + 3 + 1) popular values anyway because of the number of buckets the very popular values will cover.)

We have 352 distinct values in the table, but 102 of them are virtually invisible. Wouldn’t it be nice if we could create a frequency histogram on the top 250 values (adding in the low and high values for the column, if necessary) and tell Oracle how to assume that any other values have an average of about 20 rows each. That’s what 12c does, by the way, with its new top-N histogram, and it’s what we can do with a few calls to the dbms_stats package if we really feel that it’s important.

Faking the Frequency

Let’s go back to the data set we used in the previous article to demonstrate how to construct a simple frequency histogram when we have a good idea of the values we want to use. Remember that we had a status column with a data distribution that looked like this:

Since it’s a very small set of values, and we’re quite happy to say that this set of figures is a “good enough” model of the data whenever we need to get the optimizer to run our critical queries, it’s very easy to write a procedure to recreate a frequency histogram whenever Oracle’s ordinary stats gathering mechanism creates new stats on this table.

The code depends on three key procedures in the package: get_column_stats(), prepare_column_stats() and set_column_stats(), and assumes that table stats (of some description) have already been gathered since it reads the current stats for the column into local variables, adjusts the variables and then writes them back to the data dictionary.

There are a couple of key details that are worth mentioning. First, in the previous article, I said that Oracle uses “half the least popular frequency” for predicates involving values that are missing from the histogram; it seems that when you use set_column_sets() to create stats, Oracle uses the density to calculate the cardinality of missing values – which is why I have set the density in this code, I’ve decided that if you query for a missing value its cardinality should be 10. (If you check view user_tab_columns the only difference between gathering and setting stats is that the column user_stats is set to YES when you set stats, and NO when you gather them.)

Secondly, you’ll notice that for my list of values I’ve right-padded each value to 15 characters with spaces. My table definition had status as a char() column, and Oracle treats char() and varchar2() slightly differently when converting strings to numbers – for char() columns you must rpad() to 15 characters like this, for varchar2() you must not pad. If you get this wrong the calculations will be wrong.

Finally, I have used an array of type dbms_stats.chararray; there are several other array types defined for the dbms_stats package (e.g. numarray, datearray) and you need to choose the type that is appropriate for the column you are hacking.

Choosing your stats

So if you know what values to use and their frequencies you can easily set up a fake frequency histogram. If you don’t know the data very well you might want to derive the histogram information from the data. My view on histograms is that you really ought to have only a very small number, they’re generally frequency histograms, and you usually need to know the data, so I’ve never written a generic (hence complicated) script to deal with every option; I have a framework I use and create a custom procedure for each column that needs it. The core of the procedure is basically a simple SQL statement that identifies the minimum and maximum values and the top N most frequently occurring values along with their frequencies, and some high-level summary information; here’s a sample query with sample output:

The innermost part of the query is the aggregate by column value that we need, and which we hope will reduce a large data set to a small result set. We then order this by column frequency, using a couple of analytic functions to bring in the values corresponding to the minimum and maximum column values. Then we select the rows corresponding to the top N frequencies and the minimum and maximum column values. Finally we use a couple of analytic functions to count the actual number of buckets we’ve selected and sum the total volume of data in the histogram, returning the results in order of column value.

The difference between the sample value and the hist_size tells us whether this is likely to be a “good enough” approximation to the whole data set. The buckets value tells us whether we’ve had to add the minimum and maximum values to the list or whether they naturally fell into our Top-N, and the min_n1, max_n1 columns tell us the actual minimum and maximum values for the columns. The N1 and CT columns give us the details that (with a little manipulation, perhaps) we’re going to put into the two dbms_stat arrays.

Once we have a query like this, we need do little more than change our previous version of the procedure to get rid of the array assignments, replacing them with a cursor for loop:

There are a number of little details needed around the edges to make this a complete solution – initialising variables, deciding on a suitable density, deciding how to handle the case where the minimum and maximum values didn’t appear in the Top-N, deciding what to do if the volume of data captured is too small compared to the sample size, and so on. Further refinements are left as an exercise to the reader.

Conclusion

Sometimes you have to do some fairly subtle things to get the best out of Oracle, and when you’re dealing with histograms you’re in an area where you’re most likely to need some custom mechanisms to get things right – but probably only for a few special cases.

One special case appears when you have a column that holds more distinct values than Oracle currently allows for a frequency histogram (254) and you find that a height-balanced histogram doesn’t capture as much information as you would like and, quite possibly, gives you changes in execution plans whenever you gather statistics. In a case like this you may decide to construct a “fake” frequency histogram that holds details of the most popular values in the table and also tells the optimizer how to work out a representative cardinality for the rest of the values.

If you know your data well you could set up a procedure that uses a fixed set of numbers for the histogram, otherwise you could query the data to generate a suitable “top N” set of stats. Whichever you choose to do, you need to remember that the optimizer will scale up the size of the histogram you’ve generated to match the value of (user_tables.num_rowsuser_tab_cols.num_nulls), so you should use this technique only when the popular values represent a very large percentage (perhaps as much as 98%) of the total volume of data.