The Phantom Menace in Unit Testing

Testing is a challenging yet crucial part of software development, but how do you know that a test is telling you what you need to know? In this article, Michael Sorens explores the concept of phantom tests that return correct results but don’t actually prove anything.

Let me state up front that this is not a rant about unit testing; unit tests are critically important elements of a robust and healthy software implementation. Instead, it is a cautionary tale about a small class of unit tests that may deceive you by seeming to provide test coverage but failing to do so. I call this class of unit tests phantom tests because they return what are, in fact, correct results but not necessarily because the system-under-test (SUT) is doing the right thing or, indeed, doing anything.

In these cases, the SUT “naturally” returns the expected value, so doing (a) the correct thing, (b) something unrelated, or even (c) nothing, would still yield a passing test. If the SUT is doing (b) or (c), then it follows that the test is adding no value. Moreover, I submit that the presence of such tests is often deleterious, making you worse off than not having them because you think you have coverage when you do not. When you then go to make a change to the SUT supposedly covered by that test, and the test still passes, you might blissfully conclude that your change did not introduce any bugs to the code, so you go on your merry way to your next task. In actuality, you simply do not know if you introduced any bugs because your test (or tests) are not reporting valid information.

Invoking Some Spirits

What exactly is a phantom test? Consider this example. Say you have a function AccumulateWhenGreen(value, condition). The parameters are:

  • Value — the number to add to the accumulation
  • Condition — red, yellow, or green indicating some status

The name infers that it should accumulate the given value when the condition is green and skip the value when the condition is red or yellow. To evaluate the function, write a unit test (this is in pseudo-code rather than any particular language).

If that test passes, the function has successfully fulfilled that clause in the contract, right? (By contract I mean the software requirements to be implemented.) Not so fast. Look at the pseudo-code for AccumulateWhenGreen itself:

That code is correctly written to implement the relevant requirement. However, do not take my word for it– prove it. Change the test, so the second argument passed to AccumulateWhenGreen is Condition.Green instead of Condition.Red. What happens to the test? Now the test fails, because the value gets added to the accumulator and thus the accumulator is non-zero. Finally, change the parameter to Condition.Yellow and the test again passes. Q.E.D.

So far so good. Now consider this alternate implementation of AccumulateWhenGreen:

At first glance, it looks like this does some convoluted computation to make that code do what it is supposed to. Moreover, the “AccumulateWhenGreen skips value when red” unit test will pass: the accumulator will not be changed. Is it because that computation somehow takes the input value “23” into account? No; the unit test will pass for any integer you care to provide to the function. That’s good, right? Why is the code working? The answer is that it is not. Sure, the test passes for Condition.Red. It also passes for Condition.Yellow. Fine. However, for Condition.Green, the test still passes when it should fail, because the accumulator is supposed to change for Condition.Green.

At the very beginning I mentioned three things the code could do:

(a) the correct thing

(b) something unrelated

(c) nothing

In this case, the code is doing something unrelated. Notice that the Condition argument is suspiciously absent from the calculation. Paraphrasing a professor from my university days, the code is providing an answer to some question, just not the correct one! Consider the third alternative, doing nothing. With this code…

…and get the same results: the unit test passes for Condition.Red and for Condition.Yellow – both of which are good news – and for Condition.Green, which is bad news.

How to solve this? Recall above that, with the correct code in place, the test passed for Condition.Red or for Condition.Yellow but failed with Condition.Green. One way to avoid the phantom menace is to add more tests:

With this suite of tests in place, the correct code–case (a)–passes all three tests, but the incorrect code–cases (b) or (c)–fails on the third test.

Maxim #1:

A test checking that nothing happened

must be accompanied

by a test checking that something happened

 

Notice that the tests for Condition.Red and for Condition.Yellow passed with the correct code, unrelated code, or no code under test. That is, they always passed. Do they actually serve a purpose then? Yes! At this moment, those tests always pass, so you may conclude that if they pass, they provide no useful information about the correctness of the SUT. However, if down the road, upon making changes to the SUT they start failing, then those changes did introduce some problem.

Maxim #2:

A phantom test proves nothing if it passes.

It indicates a real problem if it fails.

 

Can you make a phantom test more solid (pun intended)? That is, can you make a test that is supposed to confirm nothing happened mean that nothing happened correctly? (Again, that means (a) the SUT has correct code rather than (b) unrelated code or (c) no code.)

Yes! Bring on our guest practice for this segment — test-driven development (TDD). Whether or not you use TDD, whether or not you write your tests first or last, the following can help you do (apologies to non-native English speakers for being cute here!) well, nothing. More formally, the following can help you create purportedly phantom tests—tests that confirm that nothing happened—in a way that you can have confidence that the code did that no-op in a correct manner.

TDD principles state that when you want to add new functionality, you first add a new test and that the new test must fail. If it does not fail, then either you have added a test for something your system already does, and presumably you already have a test for already, or you have added a phantom test, and the test will always pass.

Here is one way the story might have unfolded with our sample code and tests:

Create the first, happy path test:

Write some code to make it pass:

Add the next two tests together:

Both of those tests will fail. By Maxim #2, that says there is a problem to fix, as there should be. Write some more code that makes those tests now pass, adding the conditional in this case:

With that in place, all tests pass. Moreover, those two new tests are now purportedly phantom tests. However, they began as failing tests and turned into passing tests as the code evolved, so they have provided value.

Maxim #3:

How you arrive at a phantom test matters.

 

I illustrated the above with TDD because it is vital that when you introduce a new test that it first fails. You can sometimes meet this requirement in a non-TDD fashion, but it takes more work. Either you need to add some more logic to your test to get to a state where it will fail, or you need to break your working SUT so that the test fails. Once you confirm that the test is failing for the right reason, then make it pass by backing out those artificial tweaks.

A Real-World Example: The Authorisation Problem

Sometimes you have to live with the presence of phantom tests, but often you can convert them to real, non-phantom tests. To illustrate this point, turn from the above academic example to consider a real-world example. Say you are designing an authorisation system to regulate access rights to resources in your enterprise system. One typical foundation of such an authorisation system might succinctly be:

User U is authorised to perform an action A if there is a policy that allows U to perform A and there is no policy that denies U from performing A.

Here are the tests you might come up with.

T0 – With no policies, an action is denied

That certainly is part of the contract because, per the stated requirement, there is no policy allowing the action, therefore the result should be denied.

T1 – With a policy allowing an action, the action is allowed

Clearly, from the requirement, the presence of such a policy should result in the action being allowed. Designate this policy P1, as you will use it again shortly.

Next, the test to confirm that when there is a policy that denies U from performing A, the authorisation decision is, in fact, “denied”.

T2 – With a policy denying an action, the action is denied

Here create a single policy P2 that denies U from performing A and check the resultThe result should be “denied,” but what does that tell us? If you remove P2—where you now have no policies at all—then check the result; it will still come back with “denied”. Why? Because there was no policy allowing A. This test is a classic phantom test: the requirement is to test that the presence of P2 caused the outcome to be denied. Yet removing P2 yielded the same result, so the fact that the test passes does not prove anything.

You could turn this phantom test into a solid test, though, by bringing in policy P1 that was created earlier. It allows U to perform A. Thus, instead of using just P2 in this test use P1 + P2. If the result is “denied,” it is due to the presence of P2. Can you prove that? Certainly. If you remove P2 the result will be “allowed” because there exists a policy (P1) that allows U to perform A. Therefore, the test will fail. This test—with P1 + P2—is now a solid test!

Maxim #4:

Whenever possible convert phantom tests to real tests.

 

Conclusion

Phantom tests are sneaky. They can be hard to spot, and they provide a false sense of security. You have taken the first step to combat phantom tests just by being aware of their existence. When you do uncover a phantom test, look for ways to turn it into a solid test, so that it does not always pass. You should be able to make the test fail then, by adding in the key thing you want, make it pass. If, however, you must have a test that checks for nothing happening, make sure it is at least accompanied by a test that checks for something happening, too.