Comparing Groups for Similarities in Power Query – Using Cosine Similarity

I’ll admit upfront—I am not a data scientist by trade. Instead, I’ve picked up my data science skills over time, learning through a combination of osmosis from talented colleagues and tackling real-world data challenges. It’s been a journey of trial, error, and refinement, as I’ve worked to bridge gaps between complex data science techniques and tools available to me.

Recently, my skills were put to the test when I needed to compare hundreds of Active Directory and SharePoint Groups to find similarities in their memberships. With only Power Query available in the production environment, no Python or R to ease the process, I faced the task of finding a method to finding similarities from scratch in Power Query. In this guide, I’ll walk you through the solution I developed, highlighting the steps that made it possible.

You can download the PowerBI .pbix file here in a .zip format that is used in this article.

Cosine Similarity

Let’s first start with the data science technique I used called Cosine similarity. Cosine similarity is a metric used to measure the similarity between two vectors by calculating the cosine of the angle between them. My eyes glazed over the first time I heard that explanation, but this technique is widely applied to quantify similarity in text analysis to find similarities in documents and recommendation engines – think Netflix or Amazon recommendations.

Cosine similarity ranges from -1 to 1, where 1 indicates perfectly aligned vectors (high similarity), 0 denotes orthogonal vectors (no similarity), and -1 suggests opposite alignment. If you want to read about the whole math that defines it, see this wiki article.

Real-world Application

So, in my case, I was looking for Active Directory/SharePoint groups that had memberships that were identical (cosine similarity of 1) or nearly identical (cosine similarity > .95). With hundreds of groups that range from 1 person to over 1,000 members, this was an analytical challenge. The goal was to look for ways to consolidate groups and reduce the overhead of managing multiple groups and access to applications.

Calculating cosine similarity in Power Query presents a few challenges, however. Power Query lacks built-in vector or matrix operations, which makes performing linear algebra tasks complex. Python/R developers could just import a library and use a function to calculate this, but alas this was not an option in the environment I was working in (just Power BI with no Power BI gateways). If you are in the same boat, here’s how to do it:

Original Data – Let’s assume the data starts with a table where each row identifies a person and a single group they are associated with. Table 1 displays an example of group membership we’ll use during the rest of this article.

Group Name	Person
Red Jaguars	Jane.Doe
Red Jaguars	Barry.Allen
Red Jaguars	Diana.Prince
Blue Barracudas	Jane.Doe
Blue Barracudas	Barry.Allen
Green Monkeys	Jane.Doe
Green Monkeys	John.Doe
Green Monkeys	Barry.Allen
Orange Iguanas	Jane.Doe
Orange Iguanas	John.Doe
Orange Iguanas	Barry.Allen
Purple Parrots	Diana.Prince
Purple Parrots	Bruce.Wayne
Silver Snakes	Jane.Doe
Silver Snakes	Barry.Allen
Silver Snakes	Diana.Prince
Silver Snakes	Bruce.Wayne

Table 1 – Original Table

Set up The Table – Before applying the cosine similarity algorithm you need to create a table where each row represents the group and each column represents each individual member (e.g., universal principal name/email address). Table 2 provides an example. Notice that 1 represents that the individual member is a member of the group and 0 represents that the individual member is NOT a member of the group.

Group Name	Jane.Doe	John.Doe	Barry.Allen	Diana.Prince	Bruce.Wayne
Blue Barracudas	1	0	1	0	0
Red Jaguars	1	0	1	1	0
Green Monkeys	1	1	1	0	0
Orange Iguanas	1	1	1	0	0
Purple Parrots	0	0	0	1	1
Silver Snakes	1	0	1	1	1

Table 2 – Set up The Table

Convert the Columns to Vectors – Using Power Query we get the individual members (column names) and their value (1,0) and convert that to a Power Query list. We are left with a table with two columns Group Name, and Membership List. Table 3 provides an example of the transformed table.

Group Name	Membership as Vectors
Blue Barracudas	{1, 0, 1, 0, 0}
Red Jaguars	{1, 0, 1, 1, 0}
Green Monkeys	{1, 1, 1, 0, 0}
Orange Iguanas	{1, 1, 1, 0, 0}
Purple Parrots	{0, 0, 0, 1, 1}
Silver Snakes	{1, 0, 1, 1, 1}

Table 3 – Example of Table with Membership as Vectors (Membership List column)

Cross Join the Table – Next, we need to Cross Join the Tables. This will create a permutation of each Group to compare. We also remove rows that would compare the same group and remove duplicate combinations; moreover, we don’t need both Red Jaguars compared to Blue Barracudas and Blue Barracudas compared to Red Jaguars. Table 4 provides an example output.

Group Name	Membership as Vectors	Group Name 2	Membership as Vectors 2
Blue Barracudas	{1, 0, 1, 0, 0}	Red Jaguars	{1, 0, 1, 1, 0}
Blue Barracudas	{1, 0, 1, 0, 0}	Green Monkeys	{1, 1, 1, 0, 0}
Blue Barracudas	{1, 0, 1, 0, 0}	Orange Iguanas	{1, 1, 1, 0, 0}
Blue Barracudas	{1, 0, 1, 0, 0}	Purple Parrots	{0, 0, 0, 1, 1}
Blue Barracudas	{1, 0, 1, 0, 0}	Silver Snakes	{1, 0, 1, 1, 1}
Red Jaguars	{1, 0, 1, 1, 0}	Green Monkeys	{1, 1, 1, 0, 0}
Red Jaguars	{1, 0, 1, 1, 0}	Orange Iguanas	{1, 1, 1, 0, 0}
Red Jaguars	{1, 0, 1, 1, 0}	Purple Parrots	{0, 0, 0, 1, 1}
Red Jaguars	{1, 0, 1, 1, 0}	Silver Snakes	{1, 0, 1, 1, 1}
Green Monkeys	{1, 1, 1, 0, 0}	Orange Iguanas	{1, 1, 1, 0, 0}
Green Monkeys	{1, 1, 1, 0, 0}	Purple Parrots	{0, 0, 0, 1, 1}
Green Monkeys	{1, 1, 1, 0, 0}	Silver Snakes	{1, 0, 1, 1, 1}
Orange Iguanas	{1, 1, 1, 0, 0}	Purple Parrots	{0, 0, 0, 1, 1}
Orange Iguanas	{1, 1, 1, 0, 0}	Silver Snakes	{1, 0, 1, 1, 1}
Purple Parrots	{0, 0, 0, 1, 1}	Silver Snakes	{1, 0, 1, 1, 1}

Table 4 – Cross Joined Table

Conduct Cosine Similarity – Now it’s time to apply the algorithm by taking each Membership as Vectors and Membership as Vectors 2 and then calculate the dot product of the two lists (a.k.a. vectors), calculate the magnitude of each vector, calculate the square root for each magnitude and then divide the dot product by the square roots multiplied. This must be done with a custom Power Query function, and Figure 1 provides the current code.

(group1 as list, group2 as list) =>

let

// Extract values of both groups

vec1 = group1,

vec2 = group2,

// Calculate dot product

dotProduct = List.Sum(List.Transform(List.Zip({vec1, vec2}),

each _{0} * _{1})),

// Calculate magnitudes of both vectors

magnitude1 = List.Sum(List.Transform(vec1, each _ * _)),

magnitude2 = List.Sum(List.Transform(vec2, each _ * _)),

// Handle case when magnitudes are zero

//(to avoid division by zero)

magnitude1Sqrt = if magnitude1 = 0 then 0 else

Number.Sqrt(magnitude1),

magnitude2Sqrt = if magnitude2 = 0 then 0

else Number.Sqrt(magnitude2),

// Calculate cosine similarity

cosineSimilarity = if magnitude1Sqrt = 0 or magnitude2Sqrt = 0

then 0 else dotProduct / (magnitude1Sqrt * magnitude2Sqrt)

cosineSimilarity

Listing 1 – fnConsineSimilarity function

And you now have a calculation that defines the similarity of the group memberships. You can see that the Green Monkeys and Orange Monkeys have the same membership as the Cosine Similarity is 1. We also see the Red Jaguars and Silver Snakes have very similar membership makeup (the difference being just one person).

Group Name	Membership as Vectors (String)	Group Name 2	Membership as Vectors (String) 2	Cosine Similarity	List Differences by Name
Blue Barracudas	{1, 0, 1, 0, 0}	Red Jaguars	{1, 0, 1, 1, 0}	0.816496581	Diana.Prince
Blue Barracudas	{1, 0, 1, 0, 0}	Green Monkeys	{1, 1, 1, 0, 0}	0.816496581	John.Doe
Blue Barracudas	{1, 0, 1, 0, 0}	Orange Iguanas	{1, 1, 1, 0, 0}	0.816496581	John.Doe
Blue Barracudas	{1, 0, 1, 0, 0}	Purple Parrots	{0, 0, 0, 1, 1}	0	Diana.Prince, Bruce.Wayne, Jane.Doe, Barry.Allen
Blue Barracudas	{1, 0, 1, 0, 0}	Silver Snakes	{1, 0, 1, 1, 1}	0.707106781	Diana.Prince, Bruce.Wayne
Red Jaguars	{1, 0, 1, 1, 0}	Green Monkeys	{1, 1, 1, 0, 0}	0.666666667	John.Doe, Diana.Prince
Red Jaguars	{1, 0, 1, 1, 0}	Orange Iguanas	{1, 1, 1, 0, 0}	0.666666667	John.Doe, Diana.Prince
Red Jaguars	{1, 0, 1, 1, 0}	Purple Parrots	{0, 0, 0, 1, 1}	0.40824829	Bruce.Wayne, Jane.Doe, Barry.Allen
Red Jaguars	{1, 0, 1, 1, 0}	Silver Snakes	{1, 0, 1, 1, 1}	0.866025404	Bruce.Wayne
Green Monkeys	{1, 1, 1, 0, 0}	Orange Iguanas	{1, 1, 1, 0, 0}	1
Green Monkeys	{1, 1, 1, 0, 0}	Purple Parrots	{0, 0, 0, 1, 1}	0	Diana.Prince, Bruce.Wayne, Jane.Doe, John.Doe, Barry.Allen
Green Monkeys	{1, 1, 1, 0, 0}	Silver Snakes	{1, 0, 1, 1, 1}	0.577350269	Diana.Prince, Bruce.Wayne, John.Doe
Orange Iguanas	{1, 1, 1, 0, 0}	Purple Parrots	{0, 0, 0, 1, 1}	0	Diana.Prince, Bruce.Wayne, Jane.Doe, John.Doe, Barry.Allen
Orange Iguanas	{1, 1, 1, 0, 0}	Silver Snakes	{1, 0, 1, 1, 1}	0.577350269	Diana.Prince, Bruce.Wayne, John.Doe
Purple Parrots	{0, 0, 0, 1, 1}	Silver Snakes	{1, 0, 1, 1, 1}	0.707106781	Jane.Doe, Barry.Allen

Table 5 – Conduct Cosine Similarity

Conclusion

While the examples above are trivial, running this process across hundreds of groups with thousands of users may take some time; yet you can let Power BI do the hard work of identifying groups with similar memberships. You can also extend this to other comparison needs, like finding similar respondents to survey questions or product recommendations.

If all you have is Power BI, calculating cosine similarity is possible and can help you compare group memberships to find similar groups. If you’re interested in trying this out, we have an example file to share that contains the example tables and code used to calculate cosine similarity.

Register for Simple Talk

Comparing Groups for Similarities in Power Query – Using Cosine Similarity

Cosine Similarity

Real-world Application

Conclusion

Article tags

About the author

John Kerski

John's contributions

Articles

Books

Top topics

John's latest contributions:

13 Things I wish I knew about Power Query (when I first started)

The complexity of using Power BI in US Sovereign Clouds

Hiring and Motivating Technical Polyglots in Data Analytics

Cosine Similarity

Real-world Application

Conclusion

Article tags

Recommended

About the author

John Kerski

John's contributions

Articles

Books

Top topics

John's latest contributions:

13 Things I wish I knew about Power Query (when I first started)

The complexity of using Power BI in US Sovereign Clouds

Hiring and Motivating Technical Polyglots in Data Analytics