Statistics in SQL: The Kruskal–Wallis Test

Before you report your conclusions about your data, have you checked whether your 'actionable' figures occurred by chance? The Kruskal-Wallis test is a safe way of determining whether samples come from the same population, because it is simple and doesn't rely on a normal distribution in the population. This allows you a measure of confidence that your results are 'significant'. Phil Factor explains how to do it.

The series so far:

  1. Statistics in SQL: Pearson’s Correlation
  2. Statistics in SQL: Kendall’s Tau Rank Correlation
  3. Statistics in SQL: Simple Linear Regressions
  4. Statistics in SQL: The Kruskal–Wallis Test
  5. Statistics in SQL: The Mann–Whitney U Test
  6. Statistics in SQL: Student's T Test

A lot of things in life happen almost entirely by chance. You have to make a judgement whether the report you are creating shows sheer chance or whether it is due to some real difference. If you have a test that gives you a reasonable estimate of the likelihood that it happened by chance, then that will add a great deal of confidence to your judgement. By sheer convention, we tend to show a measure of confidence in our results if the probability of our results happening by sheer chance drops below the 5% level.(p < 0.05); the so-called ‘significance level’.

However a lot can go wrong, and a lot of research results have to be quietly forgotten about because the researchers have miscalculated the chance that their results could have happened by luck. In science, you are obliged to disprove the null hypothesis, which is that your results occurred entirely by chance. If you ‘fish’ for significant results, the criteria for rejecting the null hypothesis gets much more stringent. It is easy to forget, as well, that the the 5% level.(p < 0.05) is arbitrary, and many real-life events are far less probable  than this! Some researchers find the significance of results a difficult concept to grasp!

Generally, we rely on the normal distribution of the data to base our calculations on, but some data just isn’t normally distributed or even a continuous variable. This can make life harder. However, for correlations, we have non-parametric tests such as Kendall’s Rho. For calculating whether samples come from the same population distribution, we have the Mann–Whitney U , and, if there are more than two samples, the Kruskal–Wallis one-way analysis of variance on ranks.

The calculation leads to getting the value of H.

If the null hypothesis is true, the probability of getting a particular value of H by chance is the P value corresponding to a chi-square equal to H; When you look up the chi-squared probability, you’ll need to know the degrees of freedom, which is the number of groups minus 1.

If the sample sizes are too small, H does not follow a chi-squared distribution very well, and you need to be very cautious in your conclusion!

In our example, we have three groups of patients, one of which had the experimental treatment, another had a placebo, and the third had no treatment.

The calculation of the actual probability is pointless as well as difficult, because we only need to know whether it meets the criterion level at which we can dismiss the null hypothesis. For this reason, we don’t need the full table of probabilities, just the critical ones for the three significance levels.

In this case, we can dismiss the idea that our treatment experiment occurred by chance but we don’t know which of our three groups was different, or ‘Stochastically dominant’. We can do this later by testing pairs of groups with Dunn’s test or Mann-Whitney U