Visual Checks on How Data is Distributed in SQL Server

There are many reasons for wanting to know how data is distributed. Sometimes you just want a rough idea of the way that data is distributed in a column. You may think, wouldn't it be nice to have a SQL function that just showed you roughly what the distribution was, graphically, in the results pane. Phil Factor thought that was well and turned the vague wish into reality.

This article is all about how to go about looking at a visual representation of the way that data is distributed in tables, in a quick unpolished way, while you are programming with SSMS or whatever you prefer to use.

The way that the values of your numerical data is distributed is important. It will often tell you if some data is missing, or incorrect. It will probably tell you if it has been faked, If you are a database guy, it will tell you how valuable the data would be as an index, or how reusable your cached query plans will be. If you are creating data you can check to see if it has the same distribution as your production data.

You use a histogram to do this. This looks like a series of contiguous rectangles. The independent variable is plotted along the base (x) axis and frequencies on the vertical (y) axis. A histogram depicts the frequencies of observations occurring in certain contiguous ranges of values. These ranges are usually equal.

For this type of histogram that calculates by range, the variable is a ‘continuous’ rather than a ‘discrete’ variable. For a discrete variable, aggregating the values into continuous pockets or ranges makes no sense. Suppose we are classifying homes in a particular town according to the number of school-age children living in the house: then we should get a frequency distribution in which the independent variable would take values of 0, 1, 2, 3, 4, etc., according to the number of children living there. This data is exact: we cannot, in science, have less or more than a whole child. This sort of variable is’discrete’, as is any variable which can take only certain restricted values. You are likely to represent them by integers. Continuous variables tend to represent values of something that changes in a continuous way, such as the temperature of an oven, or the weight of sheep.

On the very nature of this distribution lies the assumptions that you can make about the data, such as the level of probability that two samples come from the same population. It is generally very important to know the way that your data is spread. When a distribution of numerical data is organized, you just create a number of evenly-spanned groups or ranges and see how many of your data values fall into each group. You usually order them from the smallest to largest, and then put into graphs and charts to examine them.

Let’s create a very simple table with a single column containing a floating point number. It could be temperature readings people’s height or product sales. Who knows?

Now, we’ll put in some data that is normally distributed.

We now do a very simple query to show the distribution. We will expect to see a bell-jar. We all do this just to get a rough idea of the distribution of the data.

… giving this in the SSMS result pane….

This bell jar is stuck to the wall. Of course it took me a short while to work out the number of pockets to use and the best divisor for the number of block characters to use for each bar. There is another problem. If there is no data in one of the pockets, it doesn’t show up, because the GROUP BY does not know of the existence of the pocket.

You could throw up your hands and paste the data into Excel, use R, or export the data into a file and use Gnuplot. With the data sizes we are working with, that soon becomes less practical. Also, you often need a quick answer in whatever application you are using to access the data, such as SSMS.

SQL Server actually maintains distribution histograms for indexes and columns. They are, however, specialised for their purpose of predicting the number of results that will be returned from a query. The query optimizer doesn’t maintain fixed ranges or pockets, but adjusts the individual histogram steps to minimize the number of steps in the histogram, up to 200 of them, whilst maximizing the difference between the boundary values. In short, they are useless for our immediate purpose, because it provides only an estimate, and it requires quite a lot of calculation to get a value for a range.

Here is a rather more robust SQL technique that takes the independent variable and converts it in such a way that it can be represented within a matrix of forty (x) by ten (y) rectangles, represented by the space, block and half-block character. Here it is plotting the histogram for our random data.

Typically, if your data is normally distributed, you’ll get histograms like this appearing in the results pane ..

But here is the distribution of the first characters of the AdventureWorks customers

You can get a very skewed distribution

…or one that is boringly flat: evenly distributed.

And you can get anything in between!

To do all this, we have created a function that does the histogram and now we are simply passing the independent variable to the graphing routine.

The final version is a table-valued function that takes a table-valued parameter as its input. This is a table with a single column consisting of a float. It returns a table that has to be ordered in the right way according to the Y value. Each line represents a row of the histogram in Unicode characters (only three are used for the actual histogram). The histogram looks better if you put the title in the alias for the Line column.

The histogram must be viewed in a monospaced font, and if it is used in SSMS, this is likely to be the case unless you view it in ‘grid’ mode. You must have ‘results to text’ selected.

The current version is here in Github. A version is attached to the article but we don’t always succeed in keeping that up-to-date.