Visual Checks on How Data is Distributed in SQL Server

This article is all about how to go about looking at a visual representation of the way that data is distributed in tables, in a quick unpolished way, while you are programming with SSMS or whatever you prefer to use.

The way that the values of your numerical data is distributed is important. It will often tell you if some data is missing, or incorrect. It will probably tell you if it has been faked, If you are a database guy, it will tell you how valuable the data would be as an index, or how reusable your cached query plans will be. If you are creating data you can check to see if it has the same distribution as your production data.

You use a histogram to do this. This looks like a series of contiguous rectangles. The independent variable is plotted along the base (x) axis and frequencies on the vertical (y) axis. A histogram depicts the frequencies of observations occurring in certain contiguous ranges of values. These ranges are usually equal.

For this type of histogram that calculates by range, the variable is a ‘continuous’ rather than a ‘discrete’ variable. For a discrete variable, aggregating the values into continuous pockets or ranges makes no sense. Suppose we are classifying homes in a particular town according to the number of school-age children living in the house: then we should get a frequency distribution in which the independent variable would take values of 0, 1, 2, 3, 4, etc., according to the number of children living there. This data is exact: we cannot, in science, have less or more than a whole child. This sort of variable is’discrete’, as is any variable which can take only certain restricted values. You are likely to represent them by integers. Continuous variables tend to represent values of something that changes in a continuous way, such as the temperature of an oven, or the weight of sheep.

On the very nature of this distribution lies the assumptions that you can make about the data, such as the level of probability that two samples come from the same population. It is generally very important to know the way that your data is spread. When a distribution of numerical data is organized, you just create a number of evenly-spanned groups or ranges and see how many of your data values fall into each group. You usually order them from the smallest to largest, and then put into graphs and charts to examine them.

Let’s create a very simple table with a single column containing a floating point number. It could be temperature readings people’s height or product sales. Who knows?

IF OBJECT_ID('tempdb..#RandomNumbers') IS NOT NULL

DROP TABLE #RandomNumbers;

CREATE TABLE #RandomNumbers (number FLOAT); --a simple table of numbers

Now, we’ll put in some data that is normally distributed.

DECLARE @Mean FLOAT; --the mean (and median if normally distributed)

DECLARE @StandardDeviation FLOAT; --(the Standard deviation you want)

DECLARE @ii INT; --counter

DECLARE @TotalWanted INT; --counter

SELECT @Mean = 50, @StandardDeviation = 20, @ii = 1, @TotalWanted = 100000;

SET NOCOUNT ON;

/* now we generate the numbers we want, with the mean and standard deviation we want */

WHILE @ii <= @TotalWanted

BEGIN

INSERT INTO #RandomNumbers

(number)

SELECT ((RAND() * 2 - 1) + (RAND() * 2 - 1) + (RAND() * 2 - 1))

* @StandardDeviation + @Mean;

SELECT @ii = @ii + 1;

END;

We now do a very simple query to show the distribution. We will expect to see a bell-jar. We all do this just to get a rough idea of the distribution of the data.

SELECT REPLICATE(N'█',COUNT(*)/1000)

FROM #RandomNumbers

GROUP BY CONVERT(INT,FLOOR(number))/5

ORDER BY CONVERT(INT,FLOOR(number))/5

… giving this in the SSMS result pane….

█

██

████

█████

███████

████████

█████████

████████

███████

█████

████

██

█

This bell jar is stuck to the wall. Of course it took me a short while to work out the number of pockets to use and the best divisor for the number of block characters to use for each bar. There is another problem. If there is no data in one of the pockets, it doesn’t show up, because the GROUP BY does not know of the existence of the pocket.

You could throw up your hands and paste the data into Excel, use R, or export the data into a file and use Gnuplot. With the data sizes we are working with, that soon becomes less practical. Also, you often need a quick answer in whatever application you are using to access the data, such as SSMS.

SQL Server actually maintains distribution histograms for indexes and columns. They are, however, specialised for their purpose of predicting the number of results that will be returned from a query. The query optimizer doesn’t maintain fixed ranges or pockets, but adjusts the individual histogram steps to minimize the number of steps in the histogram, up to 200 of them, whilst maximizing the difference between the boundary values. In short, they are useless for our immediate purpose, because it provides only an estimate, and it requires quite a lot of calculation to get a value for a range.

Here is a rather more robust SQL technique that takes the independent variable and converts it in such a way that it can be represented within a matrix of forty (x) by ten (y) rectangles, represented by the space, block and half-block character. Here it is plotting the histogram for our random data.

-- declare all the necessary local variables.

DECLARE @variable TABLE(value INT);

DECLARE @Maximum NUMERIC(18,2), @Minimum NUMERIC(18,2) , @MaxIndividualCount INT;

SELECT @Maximum = MAX(number), @minimum = MIN(number) FROM RandomNumbers;

--we convert the values to integers, with an origin of zero and range of 40

INSERT INTO @variable (value)

SELECT CONVERT(INT, (number - @minimum) / (@Maximum - @minimum) * 40.00)

FROM RandomNumbers;

-- 40 represents the number of ranges. We adjust to a zero-based origin

-- to make the plotting easier

-- we need to know the value of the largest pocket (the highest frequency)

SELECT @MaxIndividualCount = MAX(f.IndividualCount)

FROM

(SELECT COUNT(*) AS IndividualCount

FROM @variable

GROUP BY [@variable].value

) f;

SELECT CASE WHEN y=10 THEN CONVERT(CHAR(8),@MaxIndividualCount)+N'│' --we add in the highest frequency

WHEN y=3 THEN CONVERT(CHAR(8),@minimum)+N'│' -- the top of the Y axis

WHEN y=2 THEN N' …to… │' --and we also add the maximum and minimum values

WHEN y=1 THEN CONVERT(CHAR(8),@Maximum)+N'│'

ELSE N' │'end+

--SELECT ' │'+

MAX(CASE WHEN x=1 AND filled=1 THEN N'█' ELSE N' ' END)+

MAX(CASE WHEN x=2 AND filled=1 THEN N'█' WHEN x=2 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=3 AND filled=1 THEN N'█' WHEN x=3 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=4 AND filled=1 THEN N'█' WHEN x=4 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=5 AND filled=1 THEN N'█' WHEN x=5 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=6 AND filled=1 THEN N'█' WHEN x=6 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=7 AND filled=1 THEN N'█' WHEN x=7 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=8 AND filled=1 THEN N'█' WHEN x=8 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=9 AND filled=1 THEN N'█' WHEN x=9 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=10 AND filled=1 THEN N'█' WHEN x=10 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=11 AND filled=1 THEN N'█' WHEN x=11 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=12 AND filled=1 THEN N'█' WHEN x=12 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=13 AND filled=1 THEN N'█' WHEN x=13 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=14 AND filled=1 THEN N'█' WHEN x=14 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=15 AND filled=1 THEN N'█' WHEN x=15 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=16 AND filled=1 THEN N'█' WHEN x=16 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=17 AND filled=1 THEN N'█' WHEN x=17 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=18 AND filled=1 THEN N'█' WHEN x=18 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=19 AND filled=1 THEN N'█' WHEN x=19 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=20 AND filled=1 THEN N'█' WHEN x=20 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=21 AND filled=1 THEN N'█' WHEN x=21 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=22 AND filled=1 THEN N'█' WHEN x=22 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=23 AND filled=1 THEN N'█' WHEN x=23 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=24 AND filled=1 THEN N'█' WHEN x=24 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=25 AND filled=1 THEN N'█' WHEN x=25 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=26 AND filled=1 THEN N'█' WHEN x=26 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=27 AND filled=1 THEN N'█' WHEN x=27 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=28 AND filled=1 THEN N'█' WHEN x=28 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=29 AND filled=1 THEN N'█' WHEN x=29 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=30 AND filled=1 THEN N'█' WHEN x=30 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=31 AND filled=1 THEN N'█' WHEN x=31 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=32 AND filled=1 THEN N'█' WHEN x=32 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=33 AND filled=1 THEN N'█' WHEN x=33 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=34 AND filled=1 THEN N'█' WHEN x=34 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=35 AND filled=1 THEN N'█' WHEN x=35 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=36 AND filled=1 THEN N'█' WHEN x=36 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=37 AND filled=1 THEN N'█' WHEN x=37 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=38 AND filled=1 THEN N'█' WHEN x=38 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=39 AND filled=1 THEN N'█' WHEN x=39 AND filled=2 THEN N'▄' ELSE N' ' END)+

MAX(CASE WHEN x=40 AND filled=1 THEN N'█' WHEN x=40 AND filled=2 THEN N'▄' ELSE N' ' END)

FROM

(

SELECT x,y, --we work out whether the cell has data or not (1-filled, 2 means around half)

CASE WHEN f.frequency*10.00/@MaxIndividualCount>CONVERT(NUMERIC(8,2),y) THEN 1

WHEN f.frequency*10.00/@MaxIndividualCount>CONVERT(NUMERIC(8,2),y-0.5) THEN 2

ELSE 0 END AS filled

FROM

(VALUES ( 1), (2), (3), (4), (5), (6), (7), (8), (9), (10))OneToTen(y)--the Y axis (inverted!)

CROSS join

(SELECT OneToForty.value AS x, SUM(CASE WHEN g.value IS NOT NULL THEN 1 ELSE 0 END )AS frequency

FROM

(VALUES --the range of each 'pocket'

(1), (2), (3), (4), (5), (6), (7), (8), (9), (10),( 11), (12), (13), (14), (15), (16), (17), (18), (19), (20),

(21), (22), (23), (24), (25), (26), (27), (28), (29), (30),( 31), (32), (33), (34), (35), (36), (37), (38), (39), (40)

)OneToForty(value)

LEFT OUTER JOIN

@variable g

ON g.value=OneToForty.value

GROUP BY OneToForty.value)f

GROUP BY y

ORDER BY y DESC

Typically, if your data is normally distributed, you’ll get histograms like this appearing in the results pane ..

5600 │ ▄▄▄▄

│ ▄▄██████▄

│ ▄██████████▄

│ ▄████████████▄

│ ▄██████████████▄

│ █████████████████▄

│ ▄████████████████████▄

206.91 │ ▄██████████████████████▄

…to… │ ▄██████████████████████████▄

792.16 │ ▄▄██████████████████████████████▄▄

But here is the distribution of the first characters of the AdventureWorks customers

2130 │ ▄

│ ██

│ ▄ █ ██

│ ▄█ █▄ █ ▄ ██

│ ██ ██ ██ █ ██ █

│███ ██ ██ █ ██ █

65.00 │███▄ ██ █ ██ █ ███ █

…to… │████ ▄██ ████▄ █ ███ █ ▄█

90.00 │████▄███ █████▄█ ███ █████

You can get a very skewed distribution

15642 │ ▄

│ ███

│ ███▄

│ ████

│ █████

│ █████▄

│███████▄

0.11 │████████▄

…to… │█████████▄

690.66 │███████████▄▄

…or one that is boringly flat: evenly distributed.

2613 │ ▄▄ ▄▄▄▄▄▄▄▄ ▄ ▄▄ ▄▄ ▄▄ ▄▄▄ ▄ ▄▄ ▄ ▄▄▄▄

│████████████████████████████████████████

0.01 │████████████████████████████████████████

…to… │████████████████████████████████████████

1199.96 │████████████████████████████████████████

And you can get anything in between!

2124 │ ▄ ▄▄▄

│ █▄██▄███

│ ▄▄▄▄████████

│ ████████████

│ █████████████

38532.00 │ █████████████

…to… │ ▄ ▄▄▄▄▄▄▄▄██████████████

39658.00 │ █▄▄████████████████████████████████████

To do all this, we have created a function that does the histogram and now we are simply passing the independent variable to the graphing routine.

DECLARE @Variable IndependentVariable; --a simple table of numbers

DECLARE @Mean FLOAT; --the mean (and median if normally distributed)

DECLARE @StandardDeviation FLOAT; --(the Standard deviation you want)

DECLARE @ii INT; --counter

DECLARE @TotalWanted INT; --counter

SELECT @Mean = 500, @StandardDeviation = 100, @ii = 1, @TotalWanted = 100000;

SET NOCOUNT ON;

/* now we generate the numbers we want, with the mean and standard deviation we want */

WHILE @ii <= @TotalWanted

BEGIN

INSERT INTO @Variable

(number)

SELECT ((RAND() * 2 - 1) + (RAND() * 2 - 1) + (RAND() * 2 - 1))

* @StandardDeviation + @Mean;

SELECT @ii = @ii + 1;

END;

SELECT ColumnHistogram.line AS 'Normally Distributed Variable '

FROM dbo.ColumnHistogram(@Variable)

ORDER BY ColumnHistogram.y DESC;

DECLARE @SecondVariable IndependentVariable; --a simple table of numbers

INSERT INTO @SecondVariable

(number)

SELECT ASCII(SUBSTRING(Person.LastName, 1, 1))

FROM AdventureWorks2012.Person.Person;

SELECT ColumnHistogram.line AS 'Customers Surnames'

FROM dbo.ColumnHistogram(@SecondVariable)

ORDER BY ColumnHistogram.y DESC;

DECLARE @ThirdVariable IndependentVariable; --a simple table of numbers

INSERT INTO @ThirdVariable

(number)

SELECT CONVERT(FLOAT, SalesOrderHeader.OrderDate)

FROM AdventureWorks2012.Sales.SalesOrderHeader;

SELECT ColumnHistogram.line AS 'No. of orders per date period'

FROM dbo.ColumnHistogram(@ThirdVariable)

ORDER BY ColumnHistogram.y DESC;

The final version is a table-valued function that takes a table-valued parameter as its input. This is a table with a single column consisting of a float. It returns a table that has to be ordered in the right way according to the Y value. Each line represents a row of the histogram in Unicode characters (only three are used for the actual histogram). The histogram looks better if you put the title in the alias for the Line column.

The histogram must be viewed in a monospaced font, and if it is used in SSMS, this is likely to be the case unless you view it in ‘grid’ mode. You must have ‘results to text’ selected.

The current version is here in Github. A version is attached to the article but we don’t always succeed in keeping that up-to-date.

Register for Simple Talk

Visual Checks on How Data is Distributed in SQL Server

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

Recommended

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases