Creating unique datasets with managed code

Comments 0

Share to social media

Figuring out uniqueness in large datasets is somewhat trivial in SQL via the DISTINCT statement. This DISTINCT technique, however, puts a load on the SQL box to where it is more beneficial to scale out horizontally in managed code instead. Typically, it is much easier to spin more web boxes to handle traffic than it is to stand up a beefier database. Moving the work into managed code scales indefinitely to increase throughput and reduces SQL pressure.

In this take, I will show you some techniques when working with duplicate data in managed code. I will explore common gotchas and show you what to do about this.

The code will be written in C# in .NET 6, so be sure to install a copy on your local machine. If preferred, LinqPad can be used, and you will be able to copy-paste and execute code in this tool. Alternatively, feel free to clone the sample code from GitHub.

To begin, be sure to bring in the following using statements. I will use the stopwatch and write to the console often, so it is good to have these available.

I will use a timer to take performance snapshots for each technique. The goal here isn’t to be precise but to get a general feel for what is happening under the covers. I am running this code using a Debug build on an Intel 11th gen i7 1.4 GHz with 8 cores and 32GB of RAM. Your exact results may vary on your machine.

A simple example

A typical use case, and one you may already have experience with, is figuring out distinct numbers in a list of arbitrary numbers. C# makes this relatively painless via Distinct.

Because value types have good equality support, the built-in functionality works effortlessly. The underlying algorithm can distinguish between different numbers and find unique values in the set. However, a real issue starts to emerge with reference types.

Bare distinct

In the real world, complex data types dominate business applications. Say I have a bunch of lab rats, three million to be exact, and I need to figure out a distinct set without any duplicates.

In object-oriented fashion, a lab rat might look like this:

This lab rat has a name, tracking id, and color. This is an anaemic type because it is just a collection of properties without any encapsulated logic.

To generate three million rats, an extension method like Times comes in handy:

A modulus with uniqueEntries will generate duplicate data every three hundred entries. This makes the entries 300/3,000,000 unique or about 0.01 percent unique. A somewhat realistic scenario when you have a ton of duplicate records.

The extension method makes this a bit more fluent and easier to express in code. This method extends an integer like 3000000 and lets the code dot into a lambda expression that generates lab rats.

A way to figure out distinct entries can be done like this:

This, of course, does not work as intended and lacks good performance. This is because the algorithm has no choice but to check each instance, and every lab rat takes a unique reference in memory. There is quite a bit of churn here since every entry is considered unique.

To remedy this issue, it will need a way to do equality checks on complex types. Remember that without knowing the equality between lab rats, the algorithm is left with no choice but to do reference checks which is undesirable. Every complex type has a default comparer, and what is shown here is the bare behavior without explicitly defining equality.

Naïve distinct

One way to establish equality is via the EqualityComparer<T> interface. This requires two methods: Equals and GetHashCode. Figuring out equality is trivial but computing a hash code is optional.

This implementation decided to skip computing the hash code by forcing a constant. This is a nice shortcut, and without knowing what this hash exactly does, it feels like the right choice.

To test what this does:

This returns the correct number of distinct lab rats, but the performance is a dismal 4.4 seconds. Ten times slower than the simple incorrect implementation, which puts this code in a precarious situation. Of course, the opposite might be true, but, oftentimes, correctness is preferable to better performance. Ideally, you want the code to give you both correctness and good performance.

Proper distinct

There is a way to compute a hash for multiple properties based on a tuple to put in a good hash code calculation. Say there is a name, tracking id, and color for a lab rat. A tuple with these three properties has a somewhat unique hash code.

The Equals method remains intact, and this uses the hash from the tuple instead of a dumb constant. With a properly computed hash code, check to see what this does:

This time the performance is even better than the bare distinct, about twice as fast, and this returns a correct result. The reasons for the performance gains are twofold: this no longer churns through millions of records, and there is something very interesting happening with this hash code.

Why is this happening?

To find out what is happening, I will need to F12 into the Distinct method and decompile the code. This is possible in LinqPad, and any IDE available today. The focus is .NET 6, so keep in mind that implementation details are likely to change in future releases.

This is what you might see, it is not the entire code but only a small chunk of it:

These are the findings:

  • The DistinctIterator loops through the entire list at least once
  • The algorithm builds an internal hash set to nuke duplicate entries
  • The hash code optimizes the nested while loop via buckets
  • When the hash code causes collisions, the algorithm churns inside the nested loop

Please note that this code is far from complete and only focuses on the hash code. The nested while loop does not actually churn but escapes the loop, which means it can’t find all distinct values. This is mostly for the sake of brevity to avoid smacking you with a wall of code.

What is most interesting is this explains the poor performance seen in the naïve comparator. When the hash code collides for all entries, the algorithm must work harder. This spikes the complexity to a quadratic, or Big-O O(n^2) complexity. If the hash theoretically causes no collisions, expect linear or O(n) complexity.

In performance-sensitive code, it may make sense to do away with the built-in hash code entirely and switch to one that causes even fewer collisions, like murmur hash, for example.

Even though this implementation is far from complete, it is still possible to run the code:

The Count does not find all distinct lab rats and this count changes per execution. This is because the built-in hash code changes per run, which means that the hash isn’t deterministic but dynamic.

Although the implementation of Distinct might be updated in the future, it is unlikely that this external hash code contract will change because it is a critical part of determining equality and optimizing internal algorithms in .NET.

A bit of asynchrony

Say there are two lists that need to be turned into a unique set and they are coming from asynchronous data sources. Unfortunately, the framework has nothing built-in to deal with this but this extension method can come in handy.

Be sure to put this inside the existing EnumerableExtensions static class.

This grabs a parameter list and feeds it to the lambda function. Then, runs everything in parallel and returns a combined list with all the duplicate records.

To get a unique set from the combined async list:

This technique remains performant because the elapsed time only grows linearly based on the number of records, which is now six million so double the records. This has linear complexity because the parameter list remains constant. If the parameters become a boundless list, each with millions of records, then the code will spike to a quadratic complexity.

Distinct with C# records

If you are already on .NET Core, C# +8 introduces records to the mix; these provide built-in functionality for encapsulating data. One of its core features is value equality. For record types, two records are equal is they have the same type and store the same values.

To play with records, fire up another lab rat:

This has the same properties as before but expressed in less code. Because equality is built-in, it is possible to figure out distinct lab rats without implementing a comparer.

Notice the performance gets dinged a bit. This is because the built-in implementation does not use the tuple technique, which causes more collisions.

This code is equivalent to the following but using a class instead of a record:

The biggest difference here is using the hash combine helper, which is what a C# record uses internally, and this can be overridden. Notice it is also possible to avoid defining a comparer by simply inheriting IEquatable in the target class. Records implement this equatable interface too to provide equality functionality.

Now to verify this code works:

The elapsed time is different here is because the built-in hash code calculation is dynamic. This is one tidbit to keep in mind, hash codes are more of a moving target so don’t expect consistency between different types.

Conclusion

Figuring out uniqueness in managed code can be useful in taking the load off the database. The hash code dictates the efficiency of the distinct algorithm, so the best approach is to avoid too many collisions.

If you like this article, you might also like Functional monads in C#

Load comments

About the author

Camilo Reyes

See Profile

Software Engineer from Houston, Texas. Passionate about C# and clean code that runs without drama. When not coding, loves to cook and work on random home projects.