Comparing Groups for Similarities in Power Query – Using Cosine Similarity

Comments 0

Share to social media

I’ll admit upfront—I am not a data scientist by trade. Instead, I’ve picked up my data science skills over time, learning through a combination of osmosis from talented colleagues and tackling real-world data challenges. It’s been a journey of trial, error, and refinement, as I’ve worked to bridge gaps between complex data science techniques and tools available to me.

Recently, my skills were put to the test when I needed to compare hundreds of Active Directory and SharePoint Groups to find similarities in their memberships. With only Power Query available in the production environment, no Python or R to ease the process, I faced the task of finding a method to finding similarities from scratch in Power Query. In this guide, I’ll walk you through the solution I developed, highlighting the steps that made it possible.

You can download the PowerBI .pbix file here in a .zip format that is used in this article.

Cosine Similarity

Let’s first start with the data science technique I used called Cosine similarity. Cosine similarity is a metric used to measure the similarity between two vectors by calculating the cosine of the angle between them. My eyes glazed over the first time I heard that explanation, but this technique is widely applied to quantify similarity in text analysis to find similarities in documents and recommendation engines – think Netflix or Amazon recommendations.

Cosine similarity ranges from -1 to 1, where 1 indicates perfectly aligned vectors (high similarity), 0 denotes orthogonal vectors (no similarity), and -1 suggests opposite alignment. If you want to read about the whole math that defines it, see this wiki article.

Real-world Application

So, in my case, I was looking for Active Directory/SharePoint groups that had memberships that were identical (cosine similarity of 1) or nearly identical (cosine similarity > .95). With hundreds of groups that range from 1 person to over 1,000 members, this was an analytical challenge. The goal was to look for ways to consolidate groups and reduce the overhead of managing multiple groups and access to applications.

Calculating cosine similarity in Power Query presents a few challenges, however. Power Query lacks built-in vector or matrix operations, which makes performing linear algebra tasks complex. Python/R developers could just import a library and use a function to calculate this, but alas this was not an option in the environment I was working in (just Power BI with no Power BI gateways). If you are in the same boat, here’s how to do it:

  1. Original Data – Let’s assume the data starts with a table where each row identifies a person and a single group they are associated with. Table 1 displays an example of group membership we’ll use during the rest of this article.

Group Name

Person

Red Jaguars

Jane.Doe

Red Jaguars

Barry.Allen

Red Jaguars

Diana.Prince

Blue Barracudas

Jane.Doe

Blue Barracudas

Barry.Allen

Green Monkeys

Jane.Doe

Green Monkeys

John.Doe

Green Monkeys

Barry.Allen

Orange Iguanas

Jane.Doe

Orange Iguanas

John.Doe

Orange Iguanas

Barry.Allen

Purple Parrots

Diana.Prince

Purple Parrots

Bruce.Wayne

Silver Snakes

Jane.Doe

Silver Snakes

Barry.Allen

Silver Snakes

Diana.Prince

Silver Snakes

Bruce.Wayne

Table 1 – Original Table

  1. Set up The Table – Before applying the cosine similarity algorithm you need to create a table where each row represents the group and each column represents each individual member (e.g., universal principal name/email address). Table 2 provides an example. Notice that 1 represents that the individual member is a member of the group and 0 represents that the individual member is NOT a member of the group.

Group Name

Jane.Doe

John.Doe

Barry.Allen

Diana.Prince

Bruce.Wayne

Blue Barracudas

1

0

1

0

0

Red Jaguars

1

0

1

1

0

Green Monkeys

1

1

1

0

0

Orange Iguanas

1

1

1

0

0

Purple Parrots

0

0

0

1

1

Silver Snakes

1

0

1

1

1

Table 2 – Set up The Table

  1. Convert the Columns to Vectors – Using Power Query we get the individual members (column names) and their value (1,0) and convert that to a Power Query list. We are left with a table with two columns Group Name, and Membership List. Table 3 provides an example of the transformed table.

Group Name

Membership as Vectors

Blue Barracudas

{1, 0, 1, 0, 0}

Red Jaguars

{1, 0, 1, 1, 0}

Green Monkeys

{1, 1, 1, 0, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

Silver Snakes

{1, 0, 1, 1, 1}

Table 3 – Example of Table with Membership as Vectors (Membership List column)

  1. Cross Join the Table – Next, we need to Cross Join the Tables. This will create a permutation of each Group to compare. We also remove rows that would compare the same group and remove duplicate combinations; moreover, we don’t need both Red Jaguars compared to Blue Barracudas and Blue Barracudas compared to Red Jaguars. Table 4 provides an example output.

Group Name

Membership as Vectors

Group Name 2

Membership as Vectors 2

Blue Barracudas

{1, 0, 1, 0, 0}

Red Jaguars

{1, 0, 1, 1, 0}

Blue Barracudas

{1, 0, 1, 0, 0}

Green Monkeys

{1, 1, 1, 0, 0}

Blue Barracudas

{1, 0, 1, 0, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

Blue Barracudas

{1, 0, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

Blue Barracudas

{1, 0, 1, 0, 0}

Silver Snakes

{1, 0, 1, 1, 1}

Red Jaguars

{1, 0, 1, 1, 0}

Green Monkeys

{1, 1, 1, 0, 0}

Red Jaguars

{1, 0, 1, 1, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

Red Jaguars

{1, 0, 1, 1, 0}

Purple Parrots

{0, 0, 0, 1, 1}

Red Jaguars

{1, 0, 1, 1, 0}

Silver Snakes

{1, 0, 1, 1, 1}

Green Monkeys

{1, 1, 1, 0, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

Green Monkeys

{1, 1, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

Green Monkeys

{1, 1, 1, 0, 0}

Silver Snakes

{1, 0, 1, 1, 1}

Orange Iguanas

{1, 1, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

Orange Iguanas

{1, 1, 1, 0, 0}

Silver Snakes

{1, 0, 1, 1, 1}

Purple Parrots

{0, 0, 0, 1, 1}

Silver Snakes

{1, 0, 1, 1, 1}

Table 4 – Cross Joined Table

  1. Conduct Cosine Similarity – Now it’s time to apply the algorithm by taking each Membership as Vectors and Membership as Vectors 2 and then calculate the dot product of the two lists (a.k.a. vectors), calculate the magnitude of each vector, calculate the square root for each magnitude and then divide the dot product by the square roots multiplied. This must be done with a custom Power Query function, and Figure 1 provides the current code.

Listing 1 – fnConsineSimilarity function

And you now have a calculation that defines the similarity of the group memberships. You can see that the Green Monkeys and Orange Monkeys have the same membership as the Cosine Similarity is 1. We also see the Red Jaguars and Silver Snakes have very similar membership makeup (the difference being just one person).

Group Name

Membership as Vectors (String)

Group Name 2

Membership as Vectors (String) 2

Cosine Similarity

List Differences by Name

Blue Barracudas

{1, 0, 1, 0, 0}

Red Jaguars

{1, 0, 1, 1, 0}

0.816496581

Diana.Prince

Blue Barracudas

{1, 0, 1, 0, 0}

Green Monkeys

{1, 1, 1, 0, 0}

0.816496581

John.Doe

Blue Barracudas

{1, 0, 1, 0, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

0.816496581

John.Doe

Blue Barracudas

{1, 0, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

0

Diana.Prince, Bruce.Wayne, Jane.Doe, Barry.Allen

Blue Barracudas

{1, 0, 1, 0, 0}

Silver Snakes

{1, 0, 1, 1, 1}

0.707106781

Diana.Prince, Bruce.Wayne

Red Jaguars

{1, 0, 1, 1, 0}

Green Monkeys

{1, 1, 1, 0, 0}

0.666666667

John.Doe, Diana.Prince

Red Jaguars

{1, 0, 1, 1, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

0.666666667

John.Doe, Diana.Prince

Red Jaguars

{1, 0, 1, 1, 0}

Purple Parrots

{0, 0, 0, 1, 1}

0.40824829

Bruce.Wayne, Jane.Doe, Barry.Allen

Red Jaguars

{1, 0, 1, 1, 0}

Silver Snakes

{1, 0, 1, 1, 1}

0.866025404

Bruce.Wayne

Green Monkeys

{1, 1, 1, 0, 0}

Orange Iguanas

{1, 1, 1, 0, 0}

1

 

Green Monkeys

{1, 1, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

0

Diana.Prince, Bruce.Wayne, Jane.Doe, John.Doe, Barry.Allen

Green Monkeys

{1, 1, 1, 0, 0}

Silver Snakes

{1, 0, 1, 1, 1}

0.577350269

Diana.Prince, Bruce.Wayne, John.Doe

Orange Iguanas

{1, 1, 1, 0, 0}

Purple Parrots

{0, 0, 0, 1, 1}

0

Diana.Prince, Bruce.Wayne, Jane.Doe, John.Doe, Barry.Allen

Orange Iguanas

{1, 1, 1, 0, 0}

Silver Snakes

{1, 0, 1, 1, 1}

0.577350269

Diana.Prince, Bruce.Wayne, John.Doe

Purple Parrots

{0, 0, 0, 1, 1}

Silver Snakes

{1, 0, 1, 1, 1}

0.707106781

Jane.Doe, Barry.Allen

Table 5 – Conduct Cosine Similarity

Conclusion

While the examples above are trivial, running this process across hundreds of groups with thousands of users may take some time; yet you can let Power BI do the hard work of identifying groups with similar memberships. You can also extend this to other comparison needs, like finding similar respondents to survey questions or product recommendations.

If all you have is Power BI, calculating cosine similarity is possible and can help you compare group memberships to find similar groups. If you’re interested in trying this out, we have an example file to share that contains the example tables and code used to calculate cosine similarity.

Article tags

Load comments

About the author

John Kerski has over a decade of experience in technical and government leadership. He specializes in managing Data Analytics projects and implementing DataOps principles to enhance solution delivery and minimize errors. John's expertise is showcased through his ability to offer patterns and templates that streamline the adoption of DataOps with Microsoft Fabric and Power BI. His in-depth knowledge and hands-on approach provide clients with practical tools to achieve efficient and effective data operations. Throughout his career, John has dedicated himself to helping Federal Agencies and private sector clients maximize their use of Microsoft products. His extensive experience and certifications with Microsoft solutions ensure that his clients receive a high rate of return on their investments.