Diff-Objects: a PowerShell utility Cmdlet for Discovering Object differences

You can use PowerShell’s Compare-Objects to do just that, but it’s not always useful enough for database objects. Phil Factor explains how his Diff-Objects works in this post.

Introduction

It is important for many development jobs to be able to look at the differences between PowerShell objects.

It might be that you are checking how Windows processes change over time: Perhaps you are monitoring various signs of stress on a server but are at the exploratory stage, where you need to see the metrics that seem to be correlating: You might be wanting to make a high-level check to changes to a database by comparing the metadata to see what has changed. I use it to get a ‘narrative of changes’ in databases under development, and roughly when they happened.

One of the most common things you need to be able to do when developing stuff is to be able to do automated unit tests. This means detecting automatically the actual output with the correct output.

Whatever your development methodology, you need to make changes lightning fast, and the easiest way of doing that is to test frequently. If you are driving this work with PowerShell, which works well, you’ll want to compare the actual results of a process with the expected results. You’re keen to see what’s changed but will often have no idea what to look for beforehand. You need the broad view.

Fine. To do this, you need something that can tell you the differences between two objects. Yes, there is already a cmdlet to do that called Compare-Object. It is useful and ingenious, and works well for what is does, but it doesn’t do enough for our purposes. Let’s run through a few examples, just to explain why I need more for many of the things I do.

Using PowerShell’s Compare-Objects.

We first create two simple objects which have differences and similarities. They are strings which are system.objects.

Now we compare these two objects

OK. What that means is that ‘Anon’ and the blank line are in both equal in both (==) , and the rest are either just in the First poem (<=) or else in the second (=>)

I’d much prefer a side-by-side comparison that you can filter to just show the differences

OK. Let’s give it something more complicated.

Well. We get this…

Which tells us that two objects in the array are the same (==), two are only in the second object (=>) and two are in the first (<=). However, it doesn’t tell us that Anna’s salary is the difference. To get into the detail, I want something like this.

I want a ‘diff’ or difference of the entire object to see what values have changed. I’d quite like to specify the depth to compare, and what to exclude or include, especially with larger objects. You’ll notice that this result can be filtered to allow you to list just properties that are different or missing.

Problems with object comparisons.

Comparing arrays

Before we delve too far into how Diff-objects works, we need to mention a general problem with comparing arrays. Whereas, when values have unique keys it is easy to determine differences, There is a whole branch of computer science to work out how, without unique keys, you can compare two versions of an array in a way that shows how it has changed. One of the problems is that the insertion, rather the appending, of an array element makes every subsequent element different to the reference. In terms of text represented by arrays of lines, a carriage return makes everything after it a difference. Where the elements in an array are ordered, and the order is significant, it is legitimate to do this. Where they are random, and you allow duplicate values, then you must match them iteratively, one pair at a time, and then remove them as candidates for the subsequent matches.

In JSON, the order of arrays is significant, so I take this as a precedent to do the easy option.

Comparing nulls

NULLs have a variety of meanings. They can mean ‘Unknown’, which means that the result of comparing an unknown with something is always unknown. There again, a blank string is often used as if it were a NULL string. PowerShell can get confusing because it is very difficult to compare an object that doesn’t exist, and therefore returns NULL when you reference it, with an object that has a null value when you reference it. When comparing objects and their values, you need to know the difference.

I can’t find a consensus view on whether a blank string is the same as a null value. It isn’t in relational databases. One is known to be a blank string and the other is unknown. I’ve made it configurable. If you need to equate null and Blank, just set –NullAndBlankSame to $true. This is useful where you store objects as CSV because CSV has no consistent concept of a null string.

Avoiding stuff

Starting a comparison at some reference point is easy: you just specify the reference point where you start the comparison in the -Parent parameter. Ignoring embedded objects is trickier. A classical example is ignoring comments in XML files. (#comment). A far worst problem is presented by those monster arrays and god-like objects that you tend to find in .NET objects passed to you from the operating system. You can specify a list (array) of strings with the names of the objects or references (those strings in the first column) that you wish to avoid, such as ‘$.employees[2].resigned‘ in the last example.

How Diff-Objects works

I’ve done a blog post describing a Cmdlet I’ve called Display-Object. Although I’ve found it to be a very useful Cmdlet in its own right, I felt that it was a useful stepping-stone in understanding how I’ve tackled the problem of ‘diffing’ objects (finding the difference between them). I started writing Display-Object just as an illustration but, as so often in life, I got rather interested in it because it proved so useful to me in finding out what was going on inside some tricky objects.

Basically, a lot of object comparisons just ‘walk’ the hierarchy of a reference object, comparing any ‘comparable object’ (most simple values) with the same reference in the difference object This is like a left outer join. Because I want to see additions and subtractions I do the equivalent of a full outer join to find the differences

It doesn’t report what objects are different, just the values. If an object has differences in the values between the reference object and its equivalent in the difference object then you can be sure there is a difference. By ‘difference’, I include those records that appear only in one of the two objects.

Any useful Cmdlet that is designed to participate in a pipeline need to report the result of a comparison via a collection of psCustomObjects. In this way, one can use Select-Object, Where-Object and all those other useful participants. Although it introduces some redundancies, I use a four-column format.

  1. The first column is the path expression to the object. By this I mean the dot references and array indices. A dollar sign means the name of the object, as with most object paths. Basically, in PowerShell in the ISE, you add the reference except for the dollar sign to the variable referring to the object, execute it, and you’ll see the value.
  2. The second column is the value in the reference object
  3. The Third column is the value in the difference object.
  4. The fourth column contains a symbol that gives the result of the comparison. This can be
    1. ‘Both there and equal’ (‘==’)
    2. ‘Both there and different’ (‘<>’)
    3. ‘Only in the difference object’ (‘->’)
    4. ‘Only in the reference object’ (‘<-’)
    5. ‘Could not be compared (e.g. write-only values) (‘–’)

Uses for an ‘Object Diff’

Checking test results

Most Cmdlets produce objects. These are often lists of PS Custom Objects. If you can look at the output in Format-Table, that’s probably the case. Unless they are huge, these objects can be rendered as a document that can be saved. The ConvertTo … series of cmdlets are good for this. If the data is essentially tabular you can save it in its most economical form as CSV, but JSON is OK. This often gives you the opportunity to test your cmdlets as you develop them. You work out, and get general agreement about, what the result should look like for a particular set of parameters. If your cmdlet produces the same result for the same parameters, then you have a degree of confidence that you haven’t broken anything.

Normally, you’d want to keep your test materials on-disk and iterate through them. Just to illustrate how it works, though, I’ve done it in code. I’ve create a file-based dataset that represents what the Display-object Cmdlet actually should be producing, together with the object that we’re displaying. We want to make sure that Display-object still works after we alter it.

You can run this, but instead of changing the code in Diff-object, we can take the easier route and simply change the data and seeing if this is picked up by tester.

Seeing how data in large objects change.

The important point here with a large object is to only look at what you are interested in. Even a conceptually-simple object like a data table can end up with a lot of nooks and crannies full of data. A process object can be severely over-weight. You can start by just surveying the branch you’re interested in by specifying the ‘dot’ address of the data that you need to see, and avoid all the data-carbohydrates. Here, in the first example, we are checking the process where you aren’t at all interested in those arrays, so you filter them out by listing them in the ‘avoid’ parameter..

You can start ‘some way from the trunk’ by presenting the cmdlet with the same reference, and providing the parentage to the ‘parent’ parameter so that the reference is correct if you subsequently want to get an individual value. Notice that we’ve not only carved off the branch of the data we’re interested in but we’ve specified this address as ‘parent’ so that the address is correct too.

The Code

The code to for this utility is on Github. It is a bit bulky, and I feel that the code will change over time, so it might be best to get it from Github, as it is always trickier to update a published article.

 

Conclusion

Almost every PowerShell task that involves comparing objects seems to come up with another requirement. The built-in Compare-Objects cmdlet is a good start and can be persuaded to do a lot of tasks, but nothing beats a cmdlet with the source that one can alter to suit. I don’t like to develop anything I don’t immediately need for my work so I’m happy to leave something that does the job. One day I may come up with, for example with the need for a cmdlet that lists EVERY key/value pair in the target that doesn’t appear in the source, rather than to stop at the level of the first difference. I may need something that will compares arrays where the order is not significant. No worries, because now I can just add it and test it!