Equivalence Classes & Cliques
We keep harping that SQL is based on sets, but how many of us ever go back and re-read any old text books on set theory? In particular, one concept gets forgotten is equivalence relations on sets, which create equivalence classes. These classes are disjoint and we can put an element from a set into one of them with some kind of rule.
Let’s use ~ as the notation for an equivalence relation. The definition is that it has three properties:
- The relation is reflexive: A ~ A. This means the relation applies to itself.
- The relation is symmetric: : if (A ~ B) ⇔ (B ~ A). This means when the relation applies to two different elements in the set, it applies both ways.
- The relation is transitive: (A ~ B) Λ (B ~ C) ⇒ (A ~ C). This means we can deduce all elements in a class by applying the relation to any known members of the class. This is pretty important for programming, as we will see.
So, if this is such a basic set operation, why isn’t it in SQL? The problem is that it produces classes, not a set! A class is a set of sets, but a table is just a set. The notation for each class is a pair of square bracket that contain some representative of each set. Assume we have
- a ∈ A
- b ∈ B
These expressions are all the same:
- A ~ B
- [a] = [b]
- [a] ∩ [b] ≠ Φ
A common example is the set Z of integers and any MOD()
operator will give you equivalence classes. The MOD(n, 2)
operation gives you one class consisting of all even numbers, and the other consisting of all odd numbers. The nice part is that you can compute the MOD() f
unction with arithmetic.
An equivalence classes can also be defined by a set of sets. This is an important concept in graph theory and graph database. If you have not seen it, Google “Six Degrees of Kevin Bacon“. The idea is that any individual involved in the Hollywood film industry can be linked through his or her film roles to Kevin Bacon within six steps.
Let’s steal a sample set from a SQL Forum posting of friends, where ~ now means “is a friend of” and that we can use “foaf” — “Friend of a friend” — to build cliches.
- {Fred, Jessica, John, Mary}
- {Albert, Nancy, Peter}
- {Abby}
- {Frank, Joe}
A complete graph means that each node is connected to every other node by one edge. If a subgraph is complete, it is actually called a Clique in graph theory!
The complete graph Kn of order n is a simple graph with n vertices in which every vertex is adjacent to every other. The complete graph on n vertices has n(n-1)/2 edges, which corresponding to all possible choices of pairs of vertices. But this is not an equivalence relation because it does not include the reference of each node to itself. Remember A~A? Think about Abby in this set, assuming that she likes herself, in spite of her lack of social skills.
Obviously, if you have a clique with (k) nodes in it, you can a complete subgraph of any size (j <k). A maximal clique is a clique that is not a subset of any other clique (some authors reserve the term clique for maximal clique.
SQL seems to have two equivalence relations that are major parts of the language. The first is plain old vanilla equal (=) for scalar values in the base data types. SQL is just like any other programming language, but we also have NULLs to consider. Oops! We all know that (NULL = NULL
) is not true. So we have to exclude NULLs or work around them.
The other “almost” equivalence relation is GROUP BY
in which the class is the grouping columns. A quick example would be “SELECT city_name, state_code FROM Customers GROUP BY state_code
;” and this is still not quite right because we have to do some kind of aggregation on the non-grouping columns in the query.
Graphs in SQL
The idiom for graphs in SQL is to use a table with the nodes involved and a second table with pairs of nodes (a, b) to model the edges, which model our ~ relation.. This is classic RDBMS; the first table is entities (e.g. cities on a map, electronic components in a device, etc) and the second table is a relationship (e.g. roads between cities, wiring between components, etc).
The bad news is that when you use it for equivalence relations, it can get big. A class with one member is has one row to show the edge back to itself. A class with two members {a, b} has {(a, a), (b, b), (a, b), (b, a)}
to show the ~ properties as edges. Three members gives us six rows; { (a, a), (b, b), (c, c), (a, b), (a, c), (b, a), (b, c), (c, a), (c, b)}.
The general formula is (n * (n-1)) + n
, which is pretty close to (n!).
First, load the table a few facts we might already know:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
CREATE TABLE Friends (lft_member VARCHAR(20) NOT NULL, rgt_member VARCHAR(20) NOT NULL, PRIMARY KEY (lft_member, rgt_member)) INSERT INTO Friends (lft_member, rgt_member) VALUES ('John', 'Mary'), ('Mary', 'Jessica'), ('Peter', 'Albert'), ('Peter', 'Nancy'), ('Abby', 'Abby'), ('Jessica', 'Fred'), ('Joe', 'Frank'); |
Reflexive Rows
The first thing we noticed is that only Abby has a reflexive property row. We need to add those rows and can do this with basic set operators. But it is not that easy, as you can see with this insertion statement:
1 2 3 4 5 6 7 8 9 |
INSERT INTO Friends SELECT X.lft_member, X.rgt_member FROM ((SELECT lft_member, lft_member FROM Friends AS F1 UNION SELECT rgt_member, rgt_member FROM Friends AS F2) EXCEPT SELECT lft_member, rgt_member FROM Friends AS F3) AS X (lft_member, rgt_member); |
The use of table aliases is tricky. You have to be sure that the SQL engine will construct a table in such a way that you do not get scoping problems. The UNION
and EXCEPT
operators are used to assure that we do not have primary key violations.
Abby |
Abby |
Albert |
Albert |
Frank |
Frank |
Fred |
Fred |
Jessica |
Jessica |
Jessica |
Fred |
Joe |
Joe |
Joe |
Frank |
John |
Mary |
John |
John |
Mary |
Mary |
Mary |
Jessica |
Nancy |
Nancy |
Peter |
Peter |
Peter |
Nancy |
Peter |
Albert |
Symmetric Rows
Look at the update table and there are no { (a, b), (b, a)}
pairs. We need to add this second property to the relation table. This follows the pattern of set operators we just used:
1 2 3 4 5 6 7 8 9 10 |
INSERT INTO Friends SELECT X.lft_member, X.rgt_member FROM ( (SELECT F1.rgt_member, F1.lft_member FROM Friends AS F1 WHERE F1.lft_member <> F1.rgt_member) EXCEPT (SELECT F2.lft_member, F2.rgt_member FROM Friends AS F2) ) AS X (lft_member, rgt_member); |
This will give us:
Abby |
Abby |
Albert |
Peter |
Albert |
Albert |
Frank |
Joe |
Frank |
Frank |
Fred |
Jessica |
Fred |
Fred |
Jessica |
Mary |
Jessica |
Jessica |
Jessica |
Fred |
Joe |
Joe |
Joe |
Frank |
John |
Mary |
John |
John |
Mary |
Mary |
Mary |
John |
Mary |
Jessica |
Nancy |
Peter |
Nancy |
Nancy |
Peter |
Peter |
Peter |
Nancy |
Peter |
Albert |
If you do quick GROUP BY
query, you see that Abby is seriously anti-social with a count of 1 , but Mary and Jessica look very social with a count of 3. This is not quite true because the property is the dreaded “Rock-paper-scissors-lizard-Spock” relationship.”
Transitive Rows
Transitivity goes on forever. Well, until we have what mathematicians call a closure. This means we have gotten a set of elements that have all of the valid relationships. In some cases, these sets are infinite. But in database, we can only have insanely large tables. Let’s pull a subset that is part oi a clique of 4 friends.
Fred |
Jessica |
Fred |
Fred |
Jessica |
Mary |
Jessica |
Jessica |
Jessica |
Fred |
John |
Mary |
John |
John |
Mary |
Mary |
Mary |
John |
Mary |
Jessica |
Now apply the transitive relation to get some of the missing edges of a graph:
1 2 3 4 5 6 7 8 9 |
INSERT INTO Friends SELECT X.lft_member, X.rgt_member FROM ((SELECT F1.lft_member, F2.rgt_member FROM Friends AS F1, Friends AS F2 WHERE F1.rgt_member = F2.lft_member) EXCEPT (SELECT F3.lft_member, F3.rgt_member FROM Friends AS F3) ) AS X (lft_member, rgt_member); |
This is a simple statement using the definition of transitivity, without a universal quantifier or loop on it. It will add these rows to this subset:
Fred |
Mary |
Jessica |
John |
John |
Jessica |
Mary |
Fred |
When you look at Mary and Jessica, you have all of their friends. However, Fred and John do not know that they are friends. So we invoke the statement again and get those two rows. If you try it a third time, there are zero rows added.
The first thought is this sounds like a job for a recursive statement. But it is not that easy. If the original graph has a cycle in it, you can hang in an infinite loop if you try to use a recursive CTE. The assumption is that each clique has a spanning graph in the pairs in. Oops! New term: a spanning graph is a sub-graph that includes the nodes and some or all of the edges of the graph. A complete graph is the spanning graph has all the possible edges, so each node is directly connected to any other node.
Cliques
Now change your mindset from graphs to sets. Let’s look at the size of the cliques:
1 2 3 4 |
WITH X (member, clique_size) AS (SELECT lft_member, COUNT(*) FROM Friends GROUP BY lft_member) SELECT * FROM X ORDER BY clique_size; |
Abby |
1 |
Frank |
2 |
Joe |
2 |
Nancy |
3 |
Peter |
3 |
Albert |
3 |
Fred |
4 |
Jessica |
4 |
John |
4 |
Mary |
4 |
What we want to do is to assign a number to each clique. This sample data is biased by the fact that the cliques are all different sizes. You cannot simply use the size to assign a clique_nbr.
Create this table load it with the names.
1 2 3 |
CREATE TABLE Cliques (clique_member VARCHAR(20) NOT NULL PRIMARY KEY, clique_nbr INTEGER); |
We can start by assigning everyone their own clique, using whatever your favorite method is:
1 2 3 4 5 6 7 8 9 10 11 12 |
INSERT INTO Cliques VALUES ('Abby', 1), ('Frank', 2), ('Joe', 3), ('Nancy', 4), ('Peter', 5), ('Albert', 6), ('Fred', 7), ('Jessica', 8), ('John', 9), ('Mary', 10); |
Everyone is equally a member of their clique in this model. That means we could start with anyone to get the rest of their clique. The update is very straight forward. Take any clique number, find the member to whom it belongs and use that name to build that clique. But which number to we use? We could use the MIN, MAX or a random number in the clique; I will use the MAX for no particular reason
I keep thinking that there is a recursive update statement that will do this in one statement. But I know it will not port (Standard SQL has no recursive update statement right now. We would do it with SQL/PSM or a host language) and I think it would be expensive. Recursion will happen for each member of the whole set, but if we consolidate a clique for one person, we have removed his friends from consideration.
The worst situation would be a bunch of hermits, so the number of clique would be the cardinality of set. That is not likely or useful in the real world. Let’s put the update in the body of a loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
BEGIN DECLARE @clique_loop INTEGER; DECLARE @mem CHAR(10); SET @clique_loop = (SELECT MAX(clique_nbr) FROM Cliques); WHILE @clique_loop > 0 BEGIN SET @mem = (SELECT clique_member FROM Cliques WHERE clique_nbr = @clique_loop); UPDATE Cliques SET clique_nbr = (SELECT clique_nbr FROM Cliques WHERE clique_member = @mem) WHERE clique_member IN (SELECT rgt_member FROM Friends WHERE lft_member = @mem); SET @clique_loop = @clique_loop - 1; SELECT * FROM Cliques; END; END; |
As a programming assignment, replace the simple counting loop with one that does not do updates with clique_loop
values that were removed in the prior iteration. This can save a lot of work in a larger social network than the sample data used here.
Adding Members
Now that we have a Cliques table, we can do a simple insertion if the new guy belongs to one clique. However, we can have someone who knows a people in different cliques. This will merge the two cliques into one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
CREATE PROCEDURE Clique_Merge @clique_member_1 CHAR(10), @clique_member_2 CHAR(10) AS UPDATE Cliques SET clique_nbr =(SELECT MAX(clique_nbr) FROM Cliques WHERE clique_member IN (@clique_member_1, @clique_member_2)) WHERE clique_nbr =(SELECT MIN (clique_nbr) FROM Cliques WHERE clique_member IN (@clique_member_1, @clique_member_2)); |
Deleting a member or moving him to another clique is trivial.
Conclusion
This article is full of handy tricks to use with SQL. But cliques and equivalence classes are better handled with a graph database than with SQL. This is no surprise; that is what they were meant to do! The graph database can also handle relationships that do not have all three of the mathematical properties needed for an equivalence relationship.
If any readers want to add to this set of tricks, please do.
Load comments