# The Puzzle of ‘Rating Decomposition’

When reading rating information, how do you you knew how many points each separate voter gave if you only know the average rating and the number of votes? Well, you might be surprised to learn that you can figure it out using SQL

When you are planning to buy a camera or to read an article, you will always face a dilemma: how will you find the best product from the available products on the market or what article should you pick for reading from a variety of different electronic journals?

Usually, in order to make an informed choice, people start looking for additional information about the product, such as consumer reviews or ratings. This strategy works pretty well and gives the person an idea about the product’s quality. However, the average rating can be ambiguous and can indicate slightly different things.

For instance, if a camera has a rating of 4.15 out of 5 and has a sufficient number of votes, you might suppose that most people think that the quality of the camera is above average, but not excellent. However, it might be that most people gave camera the highest rating, but a few people found the location of the control buttons inconvenient and too dissimilar to their old cameras and gave the camera a very low mark, so dragging down the average rating.

An average rating on its own does not give you the full information about the product. If you knew how many points each separate voter gave to the camera, the information could be more precise and your decision of what type to buy could be much easier.

But, how do you get that information, if you only know the average rating and the number of votes? Well, you might be surprised to learn that you can figure it out using SQL.

Before we start, create and load an auxiliary table that stores only sequential numbers (see Listing1):

Listing1. Create and Load an Auxiliary Table

SET NOCOUNT ON;
DECLARE    @i INT, @maxi INT;
SELECT @i = 0, @maxi = 1000;
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘dbo.sequence’)) AND xtype = ‘U’)
DROP TABLE dbo.sequence;
CREATE TABLE dbo.sequence(num INT NOT NULL);
WHILE(@i <= @maxi)
BEGIN
INSERT INTO dbo.sequence VALUES(@i);
SET @i = @i + 1;
END

#### Step1: Build Subsets of Possible Marks

We need to build a framework that will help us to find and examine the possible readers’ votes. Our first step will be to find the subsets of possible marks

Let’s start with the number of votes, which on its own can give you a lot of information.

For example, if 15 readers voted for the article, then any of the following predicates can be true:

• 0 readers rated the article 1 out of 5;
• 1 reader rated the article 1 out of 5;
• 2 readers rated the article 1 out of 5;
• . . . . . . . . . . . . . . . . . ;
• 15 readers rated the article 1 out 5;
• 0 readers rated the article 2 out of 5;
• 1 reader rated the article 2 out of 5;
• 2 readers rated the article 2 out of 5;
• . . . . . . . . . . . . . . . . . ;
• 15 readers rated the article 2 out of 5;
• 0 readers rated the article 3 out of 5;
• . . . . . . . . . . . . . . . . . ;

And so on.

If you translate this logic into T-SQL, you will get five subsets of possible marks. The code in Listing2 shows haw you can build the subsets;

Listing2. Build Subsets of Possible Marks

IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘spu_createVoteTables’)) AND xtype = ‘P’)
DROP PROCEDURE spu_createVoteTables;
GO
CREATE PROCEDURE spu_createVoteTables
AS
— create and load table for OneS
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘dbo.OneS’)) AND xtype = ‘U’)
DROP TABLE dbo.OneS;
CREATE TABLE dbo.OneS
(point INT NOT NULL,
sumPoints dec(8,4) NOT NULL);

INSERT INTO dbo.OneS
SELECT 1, num, 1*num FROM sequence WHERE num <= @numVotes;
SELECT * FROM dbo.OneS;

— create and load table for TwoS
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘dbo.TwoS’)) AND xtype = ‘U’)
DROP TABLE dbo.TwoS;
CREATE TABLE dbo.TwoS
(point INT NOT NULL,
sumPoints dec(8,4) NOT NULL);

INSERT INTO dbo.TwoS
SELECT 2, num, 2*num FROM sequence WHERE num <= @numVotes;
SELECT * FROM dbo.TwoS

— create and load table for ThreeS
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘dbo.ThreeS’))
AND xtype = ‘U’)
DROP TABLE dbo.ThreeS;
CREATE TABLE dbo.ThreeS
(point INT NOT NULL,
sumPoints dec(8,4) NOT NULL);

INSERT INTO dbo.ThreeS
SELECT 3, num, 3*num FROM sequence WHERE num <= @numVotes;
SELECT * FROM dbo.ThreeS

— create and load table for FourS
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘dbo.FourS’)) AND xtype = ‘U’)
DROP TABLE dbo.FourS;
CREATE TABLE dbo.FourS
(point INT NOT NULL,
sumPoints dec(8,4) NOT NULL);

INSERT INTO dbo.FourS
SELECT 4, num, 4*num FROM sequence WHERE num <= @numVotes;
SELECT * FROM dbo.FourS

— create and load table for FiveS
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘dbo.FiveS’)) AND xtype = ‘U’)
DROP TABLE dbo.FiveS;
CREATE TABLE dbo.FiveS
(point INT NOT NULL,
sumPoints dec(8,4) NOT NULL);

INSERT INTO dbo.FiveS
SELECT 5, num, 5*num FROM sequence WHERE num <= @numVotes;
SELECT * FROM dbo.FiveS
GO

The results look like this:

———– ———– ————-
1           0           0.0000
1           1           1.0000
1           2           2.0000
.  .  .  .  .  .  .  .  .  .  .
1           15          15.0000

———– ———– ————-
2           0           0.0000
.  .  .  .  .  .  .  .  .  .  .
2           15          30.0000

———– ———– ————-
3           0           0.0000
.  .  .  .  .  .  .  .  .  .  .
3           15          45.0000

———– ———– ————-
4           0           0.0000
4           1           4.0000
.  .  .  .  .  .  .  .  .  .  .
4           15          60.0000

———– ———– ————-
5           0           0.0000
5           1           5.0000
.  .  .  .  .  .  .  .  .  .  .
5           14          70.0000
5           15          75.0000

Notice that the number of generated sets (tables) is equal to five and the number of rows in each table is equal to 15.

In general, the number of sets (tables) is equal to a base of the voting system (for the base-5 voting system, the number of tables will be five). The number of rows in each table is equal to the number of votes.

Since each magazine, electronic journal or consumers’ website has its own voting system and that system can change very rarely, the number of the tables will be a static component of our framework. Each time when we analyze the new rating, we will reload these tables. Each time the number of rows in the tables will be equal to the number of votes that produced the rating.

Throughout the rest of this article, we will consider the base-5 voting system only.

#### Step2: Find Possible Total Points

Generally, calculating averages is a very simple task. All you need to do is to find the sum of all attends and then divide that total by the number of attends. The result of division then needs to be rounded to a whole number or to one (two) digits after the decimal point.

Our task, however, is different. We will try the inverse operation: calculating the possible sum total(s) for a known average and known number of attends.

Consider the next example:

You are going to read an article that has a rating of 4 when the number of votes is 15. Therefore, the total score will be 4*15=60. However, 60 is not the only possible total. Taking any number between 53 and 67, dividing by 15, and then rounding the result to a whole number, will give an average of 4.

Therefore, the inverse operation brings multiple results and, as you will see later, depends on the precision that was used to calculate the average.

The following script (Listing3) shows how to find all of the possible totals for the given rating and number of votes.

Listing3. Find Possible Total Points

IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘spu_findPossibleTotals’)) AND xtype = ‘P’)
DROP PROCEDURE spu_findPossibleTotals;
GO
CREATE PROCEDURE spu_findPossibleTotals
@rating dec(8,4),
@precision INT
AS
SET NOCOUNT ON;
DECLARE @calcPoints dec(8,4), @calcRating dec(8,4);

IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘possiblePoints’)) AND xtype = ‘U’)
DROP TABLE possiblePoints;
CREATE TABLE possiblePoints
(TotalPoints dec(8,4) NOT NULL,
calcRaiting dec(8,4) NOT NULL,
roundedRating dec(8,4) NOT NULL)

SELECT @calcPoints = ROUND(@rating * @numVotes, 0);
SELECT @calcRating = @rating
WHILE(ROUND(@calcRating,@precision) = @rating)
BEGIN
INSERT INTO possiblePoints
VALUES(@calcPoints, @calcRating, ROUND(@calcRating,@precision));
SELECT @calcPoints = @calcPoints + 1;
END

SELECT @calcPoints = MIN(TotalPoints) – 1 FROM possiblePoints
WHILE(ROUND(@calcRating,@precision) = @rating)
BEGIN
INSERT INTO possiblePoints
VALUES(@calcPoints, @calcRating, ROUND(@calcRating,@precision));
SELECT @calcPoints = @calcPoints – 1;
END
SELECT * FROM possiblePoints ORDER BY TotalPoints
GO

If you test the procedure for @rating = 4, @numVotes = 15 and @precision = 0, where precision is the number of digits after the decimal point, you will get the next result:

EXEC spu_findPossibleTotals @rating = 4, @numVotes = 15, @precision = 0

TotalPoints     calcRaiting       roundedRating
———–     ———–       ————-
53.0000         3.5333            4.0000
54.0000         3.6000            4.0000
55.0000         3.6667            4.0000
56.0000         3.7333            4.0000
57.0000         3.8000            4.0000
58.0000         3.8667            4.0000
59.0000         3.9333            4.0000
60.0000         4.0000            4.0000
61.0000         4.0667            4.0000
62.0000         4.1333            4.0000
63.0000         4.2000            4.0000
64.0000         4.2667            4.0000
65.0000         4.3333            4.0000
66.0000         4.4000            4.0000
67.0000         4.4667            4.0000

When you change the precision from 0 to 1, you will get only one possible total:

EXEC spu_findPossibleTotals @rating = 4, @numVotes = 15, @precision = 1

TotalPoints     calcRaiting       roundedRating
———–     ———–       ————-
60.0000         4.0000            4.0000

You can test the correctness of the last result very easily:

Add 1 to the result, divide the new sum by 15 and then round the quotient to one digit after the decimal point: (60 + 1) / 15 = 4.067 = 4.1. Since 4.1 is greater than the given average of 4.0, the sum of all of the points cannot be 61.

Similarly, (60-1) / 15 = 3.93 = 3.9, which is less than the given average 4.0 and then 59 cannot be a possible total as well.

You will see later that the precision of the average (the number of digits after a decimal point), can significantly increase or decrease the entropy of the end result.

#### Step3: Remove Invalid Data

We have now built the framework that we will use to analyze the readers’ votes. However, before we start the actual processing, we need to remove some invalid data from tables OneS, TwoS, ThreeS, FourS and FiveS.

The following example will explain the logic used to find incorrect data.

Suppose you are going to read an article with an average rating of 4 and a precision of 0, when the number of votes is 15.

Run the procedures that we had previously built (see Lisitng2 and Lisitng3):

GO
EXEC spu_findPossibleTotals @rating = 4, @numVotes = 15, @precision = 0
GO

The first procedure loads data into the tables OneS, TwoS, ThreeS, FourS, FiveS.

The second procedure finds all possible totals. For the given rating and the number of votes, they will be in the range from 53 to 67, inclusively.

Now, analyze the data in table OneS. If all 15 readers give article 1 point, then the sum of all marks would be 15 * 1 = 15. Since 15 is smaller than the minimal possible total (which is 53), the situation, when each voter supposedly gave the article 1 point, is incorrect. Therefore, the row where the sumPoints column is equal to 15 can be deleted from table OneS.

Similarly, if 14 readers give the article 1 point each, and the 15th reader gives article the highest mark 5, the total would be 14 + 5 = 19, which is still less than minimum possible total. So, the row where the sumPoints column is equal to 14, can also be deleted from table OneS.

Apply the same logic to the rest of the rows of table OneS and continue with tables TwoS, ThreeS, FourS and FiveS. The number of the rows in all of the tables will be reduced, in some cases significantly.

Now consider a slightly different logic. Consider the case where all 15 readers give the article 5 points each (table FiveS). The sum of all of the points would be 15 * 5 = 75, which is greater than the maximum possible total (which is 67). Therefore, the row, where the sumPoints column is equal to 75, should be deleted from the table FiveS.

If 14 readers give the article 5 points each, the total will be 14 * 5 = 70. The 15th reader can give the article the lowest mark, but this will not change anything: 70 or 71 is greater than maximum possible total (67). The row, where the sumPoints column is equal to 70 can be deleted from the table FiveS.

However, the case where 13 readers give article 5 points each is valid, because 13 * 5 = 65. If two more readers rate the article 1 out of 5 the total would be 13 * 5 + 2 * 1 = 67, which is a valid total. Therefore, the row, where the sumPoints column is equal to 65, should stay in table FiveS.

Applying the same logic, you will remove some more rows from tables OneS, TwoS, ThreeS and FourS.

Listing4 shows how you could implement the logic that was just explained:

Listing4. Remove Invalid Data

IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘spu_removeInvalidData’)) AND xtype = ‘P’)
DROP PROCEDURE spu_removeInvalidData;
GO
CREATE PROCEDURE spu_removeInvalidData
AS
DECLARE @minTotalPoints dec(8,4), @maxTotalPoints dec(8,4);
IF EXISTS (SELECT TOP 1 * FROM possiblePoints)
BEGIN
SELECT @minTotalPoints = MIN(totalPoints),
@maxTotalPoints = MAX(totalPoints)
FROM possiblePoints;
END
ELSE
BEGIN
PRINT ‘There is no data in possiblePoints table.’
RETURN
END

— Remove invalid data from table OneS
DELETE OneS
OR
SELECT * FROM Ones;

— Remove invalid data from table TwoS
DELETE TwoS
OR
SELECT * FROM TwoS;

— Remove invalid data from table ThreeS
DELETE ThreeS
OR
SELECT * FROM ThreeS;

— Remove invalid data from table FourS
DELETE FourS
OR
SELECT * FROM FourS;

— Remove invalid data from table FiveS
DELETE FiveS
OR
SELECT * FROM FiveS;
GO

Results:

———– ———– ———-
1           0           0.0000
1           1           1.0000
.  .  .  .  .  .  .  .  .  .  .
1           4           4.0000
1           5           5.0000

———– ———– ———-
2           0           0.0000
2           1           2.0000
.  .  .  .  .  .  .  .  .  .  .
2           6           12.0000
2           7           14.0000

———– ———– ———-
3           0           0.0000
3           1           3.0000
.  .  .  .  .  .  .  .  .  .  .
3           10          30.0000
3           11          33.0000

———– ———– ———-
4           1           4.0000
4           2           8.0000
.  .  .  .  .  .  .  .  .  .  .
4           14          56.0000
4           15          60.0000

———– ———– ———-
5           0           0.0000
5           1           5.0000
.  .  .  .  .  .  .  .  .  .  .
5           12          60.0000
5           13          65.0000

#### Solution: How the Readers Voted

In order to find out how the readers voted, we now:

1. Join tables OneS, TwoS, ThreeS, FourS and FiveS, using the Cross (or Cartesian) join.
2. Apply the following predicates to each intermediate join:
• The summary of the votes in the join cannot exceed the given number of votes;
• The summary of the points in the join cannot be greater than the maximum total points.
3. Include into the result (the outmost join) only the combinations, where the summary of the votes equals to the given number of votes.

Running the script shown in Listing5 will show you how the readers voted in our example:

Listing5. Get Final Result

IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘spu_getFinalResult’)) AND xtype = ‘P’)
DROP PROCEDURE spu_getFinalResult;
GO
CREATE PROCEDURE spu_getFinalResult
AS
DECLARE @minTotalPoints float, @maxTotalPoints float;
IF EXISTS (SELECT TOP 1 * FROM possiblePoints)
BEGIN
SELECT @minTotalPoints = MIN(totalPoints),
@maxTotalPoints = MAX(totalPoints)
FROM possiblePoints;
END
ELSE
BEGIN
PRINT ‘There is no data in possiblePoints table.’
RETURN
END
IF EXISTS(SELECT * FROM sysobjects
WHERE ID = (OBJECT_ID(‘result’)) AND xtype = ‘U’)
DROP TABLE result;

SELECT t1234.Ones,
t1234.Twos,
t1234.Threes,
t1234.Fours,
(t1234.sumPoints_1234 + t5.sumPoints) AS sumPoints_12345
INTO result
FROM
(SELECT t123.Ones, t123.Twos, t123.Threes, t4.votes AS Fours,
(t123.sumPoints_123 + t4.sumPoints) AS sumPoints_1234
FROM
(SELECT t12.Ones, t12.Twos, t3.votes AS Threes,
(t12.sumPoints_12 + t3.sumPoints) AS sumPoints_123
FROM
(t1.sumPoints + t2.sumPoints) AS sumPoints_12
FROM OneS t1 CROSS JOIN TwoS t2
AND (t1.sumPoints + t2.sumPoints) <= ROUND(@maxTotalPoints,0)) t12
CROSS JOIN ThreeS t3
AND t12.sumPoints_12 + t3.sumPoints <= ROUND(@maxTotalPoints,0)) t123
CROSS JOIN FourS t4
AND t123.sumPoints_123 + t4.sumPoints <= ROUND(@maxTotalPoints,0)) t1234
CROSS JOIN FiveS t5
AND t1234.sumPoints_1234 + t5.sumPoints BETWEEN ROUND(@minTotalPoints,0)
AND ROUND(@maxTotalPoints,0);
SELECT * FROM result;

GO

Results:

Ones    Twos    Threes  Fours  Fives  totalVotes  totalPoints
——- ——- ——- —— —— ———– ———–
0       0       0       15     0      15          60.0000
0       0       1       14     0      15          59.0000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3       3       0       1      8      15          53.0000
4       0       0       3      8      15          56.0000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2       0       1       0      12     15          65.0000
2       0       0       0      13     15          67.0000

(871 row(s) affected)

The result of the calculation is in the table result. Each row of that table represents one possible voting combination. Columns One to Five store the number of readers that rated the article 1 to 5, out of 5, respectively.

The number of rows in the table, i.e. the number of possible voting combinations is equal to 871 and this is probably the most frustrating part of the result. Indeed, only one combination from 871 possible ones is the correct answer, but there is no way to say which one. Therefore, for the given input (15 votes and the average rating 4, rounded to a whole number) it is impossible to uncover how readers voted.

Does this mean, however, that the framework that we built, and all of the results, is useless? Well, no, because you still can get some interesting statistics from the result.

For example, try the following query:

SELECT COUNT(OneS) FrequencyPerOnes, Ones
FROM result
GROUP BY Ones
ORDER BY Ones

Results:

FrequencyPerOnes Ones
—————- ———–
377              0
254              1
146              2
67               3
23               4
4                5

From this we can deduce that:

1. There are no voting combinations where the number of ones exceeds 5;
2. There are 377 combinations, where no readers gave the article 1 point;
3. About 90% of all voting combinations are combinations where 2 or less readers rated the article 1 out of 5;

Our technique can provide much more accurate results when the average rating has a higher precision (in most electronic journals and websites, the precision of the rating is one or two digits after the decimal point).

For 15 votes and an average rating of 4.1 (rounded to one digit after the decimal point), you will get only 86 voting combinations (compare this to 871 combinations for the average rating 4):

GO
EXEC spu_findPossibleTotals @rating = 4.1, @precision = 1, @numVotes = 15;
GO
GO

Results:

Ones  Twos  Threes  Fours  Fives  totalVotes  totalPoints
—– —– ——- —— —— ———– ————
0     0     0       14     1      15          61.0000
0     0     0       13     2      15          62.0000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2     2     0       0      11     15          61.0000
3     0     0       1      11     15          62.0000
3     0     1       0      11     15          61.0000

(86 row(s) affected)

For 15 votes with the average rating 4.13 (rounded to two digits after the decimal point), you will get only 39 voting combinations

GO
EXEC spu_findPossibleTotals @rating = 4.13, @precision = 2, @numVotes = 15;
GO
GO

Results:

Ones  Twos  Threes  Fours  Fives  totalVotes  totalPoints
—– —– ——- —— —— ———– ————
0     0     0       13     2      15          62.0000
0     0     1       11     3      15          62.0000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2     1     1       0      11     15          62.0000
3     0     0       1      11     15          62.0000
1     3     0       0      11     15          62.0000

(39 row(s) affected)

This is actually as good as it gets. Use of a more precise average rating (with three or more digits after decimal point) will not reduce the number of possible voting combinations.

However, if you have some additional information about the voting, you can narrow down your search still further.For example, the article has been rated by 15 readers and has a rating of 4.13 with the precision of 2. If, somehow, you know that at least one reader gave the article 1 point, you will get 18 possible combinations instead of 39:

SELECT * FROM result WHERE Ones > 0;

Results:

Ones  Twos  Threes  Fours  Fives  totalVotes  totalPoints
—– —– ——- —— —— ———– ————
1     0     0       9      5      15          62.0000
1     0     1       7      6      15          62.0000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3     0     0       1      11     15          62.0000
1     3     0       0      11     15          62.0000

(18 row(s) affected)

Furthermore, if you know that another reader gave the article 2 points, you will get only 9 possible combinations:

SELECT * FROM result WHERE Ones > 0 AND Twos > 0;

Results:

Ones  Twos  Threes  Fours  Fives  totalVotes  totalPoints
—– —– ——- —— —— ———– ————
1     1     0       6      7      15          62.0000
1     1     1       4      8      15          62.0000
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2     1     1       0      11     15          62.0000
1     3     0       0      11     15          62.0000

(9 row(s) affected)

You may be thinking that rating decomposition always results in multiple voting combinations, but this is not always the case. For some inputs, you can get just a few or even one voting combination.

The following example for 8 votes and a rating of 4.75 will retrieve only two possible voting combinations:

GO
EXEC spu_findPossibleTotals @rating = 4.75, @precision = 2, @numVotes = 8
GO
GO

Results:

Ones  Twos  Threes  Fours  Fives  totalVotes  totalPoints
—– —– ——- —— —— ———– ————
0     0     0       2      6      8           38.0000
0     0     1       0      7      8           38.0000

As a rule of thumb, the accuracy of decomposition depends on the number of votes and on the precision that was used to calculate the rating. The more votes and the fewer digits after the decimal points in the rating, the more entropy (more possible voting combinations) will be in the result.

#### Performance Considerations

Our solution was based on cross or Cartesian join, which in turn generates the Cartesian product -a huge number of the rows in the result. Since the solution uses cross-joins four times, it will definitely experience performance problems for large number of votes.

Indeed, when I ran the procedures on my single CPU desktop, for 87 votes and an average rating of 3.43, I got the result in 3 min. and 43 sec. For votes = 127, rating = 2.77, the execution time was 16 min. and 38 sec.

You can get better execution time if you just generate the result, but do not insert it into the table result (as in the Listing5). In that case, the procedure spu_getFinalResult will run almost two times faster. Placing indexes on tables OneS, TwoS, ThreeS, FourS and FiveS may also improve the performance.

Basically, you should be careful when using this solution for hundreds or thousands of votes as the algorithm we use here may become inefficient. Some good news, though, is that in real life, the number of votes for the articles (products, books) rarely goes over 100.

### Conclusions

How should you treat this article? Was it a fun puzzle? Was it something specific that can be applied only to very limited situations?

Yes, ultimately, you should consider this article as a puzzle, although the algorithm provided can be very useful for some specific cases.

For example, say you know the value of an aggregate function (SUM or COUNT or AVG and so on) and a few (from many) values that participated in the aggregate creation. If you do not have an access to a database (data) that produced the aggregate, how would you find the values that assembled the aggregate? This is the sort of situation where this algorithm can help.