Many computer systems nowadays have their ‘correctness’ checked using sample testing: an engineer programs a scenario and runs the test system against it. The system’s actual output is captured and measured against the expected output, and if these match up, then the test is passed. But for some systems, you need to test all scenarios.
This is the case for highly critical (and complex) systems such as satellites, rockets, railway signals, on-board flight controllers and flood-control barriers, where simply sample testing is just not enough. They are not much noticed, but have a huge impact on daily life, especially when they start failing. Imagine space rockets crashing from the sky or trains running into each other because of small bugs in the system. Nobody wants that; failures are too costly in terms of lives and money. Lots of engineers specialising in quality assurance, safety, dependability, reliability and performability put every effort into prevent them from happening, but it’s not an easy task. The kind of systems they analyse so thoroughly are generally highly multi-threaded or event-driven, and it is fiendishly difficult to test them exhaustively for correctness because of the huge number of ways they can respond.
Imagine space rockets
crashing from the
sky, or trains running
into each other
because of small bugs
in the system.
Nobody wants that.
In order to test all the possible responses of a given system, the scientific community has for the last few decades been investigating model checking.
Model checking tools explore all scenarios and verify whether responses meet expectations in each instance. These techniques have been successfully applied to verify NASA’s Mars Rover, Lucent Technologies’ PathStar telephone switch and the storm surge barrier control system in the Dutch Delta Works dikes. So far, so good. But past model checking techniques required engineers to create and verify a model of the system, when it is the system itself which ultimately needs to be functioning correctly. So that is why, in the past decade, people from NASA started to apply model checking techniques at a system-level.
With that in mind, a team comprising of myself, my predecessor and our supervisor – Theo Ruys – developed the MoonWalker model checker at the University of Twente. MoonWalker verifies multithreaded .NET programs by inspecting the CIL bytecode instructions and exploring all possible interleavings of the threads. It is also the only publicly available .NET model checker to date that employs a set of efficient algorithms and is, in terms of performance, on par with other state-of-the-art tools. You can read all about it in Software Model Checking for Mono, which contains an introductory chapter on model checking and an in-depth explanation of the algorithms out there.
We’ve seen many
case studies where
clusters with dozens
of GB’s of memory
and dozens of
Efficient model checking is all about employing the best algorithms, or else you easily end up hitting the limits of your computer’s processing power. We’ve seen many case studies where model checkers completely consume clusters with dozens of GB’s of memory and dozens of processing cores. It’s a generally known problem in the model checking community, and it drives us and our colleagues around the world to eagerly develop better algorithms.
So recently we spent months working out three new algorithms for MoonWalker; Theoretically they looked much better, but we wanted to see whether they would really speed up MoonWalker in practice. Naturally, we needed to implement them as efficiently as possible to squeeze the most out of them, but this is not easily done in a tool as complex as a model checker, which is constructed with some of the most advanced algorithms around. So we made heavy use of a profiler to understand bottlenecks and aggressively optimise the algorithms.
Measure First, Optimise Later
When developing MoonWalker we read some interesting papers on hashing in model checking, and recalled from our experiences that the hash function is called up to billions of times during an average verification. Here was an opportunity to devise a faster hash function! Based on cyclic polynomials and Knuth hashing with a constant time-complexity, this would be much faster than the linear time-complexity of traditional algorithms. And so with growing excitement, this new algorithm was worked out on paper and called incremental hashing. When we proved that the theory behind it was sound, we enthusiastically implemented it in MoonWalker.
The results of the first verifications runs were not what we expected. We used small .NET models, and simply used the cygwin time command to measure the execution times. There was no decrease and we initially thought this was because of the simplicity of our models. We crafted a few more complex ones and reran the verifications – also with no decrease in execution time. This was puzzling and disappointing because, in theory, our incremental hashing should have been much faster.
We started wondering whether the .NET virtual machine might be able to optimise the old algorithm to the point where it was as efficient as the new one. So the old hashing algorithm was blown up by adding some unoptimisable loops, but this also didn’t give us decreases in execution time.
Completely stumped, we started to google for answers, specifically trying to understand how the compiler and the CPU optimised the instruction code. We even went through the optimised CIL bytecode step-by-step using the debugger built in to Visual Studio.
In a flash of insight, we realised we should measure the cumulative time spent in the old hash function and the incremental hash function, and look for an absolute difference.
We immediately began googling for .NET profilers and found several. We finally settled on ANTS profiler because it shows the profiling results in a way that gives a quick overview and is fast to comprehend. Using the profiler in detailed mode, we found that the absolute amount of time spent in the old hash function was much higher than the incremental hash function. Multiple times higher, in fact. So the big question was:
Why wasn’t this difference reflected in the total running time?
When in detailed mode, ANTS can show a piece of the source code with indications of the amount of time spent on each line. We analysed both the old hashing function and the incremental hashing function in this incredibly fine-grained way… and still couldn’t explain why there wasn’t a difference!
Then old optimisation wisdom came to mind – stuff that we’d read years back in an old ‘optimising Java code’ article:
We should only optimise the bottlenecks, which are measured by their relative stake in the total execution time.
A quick look at the table showed that the old hashing algorithm had a relative stake of near zero, so plugging in a better hashing algorithm would have no effect on the total execution time. We also realised that the absolute time difference, while still big, was in milliseconds… So what we had spent all that time and energy doing was optimising a part of MoonWalker that didn’t need it.
Fortunately, the merits of our incremental hashing did not have to be limited to a neat theoretic result. We profiled another model checker developed at NASA – the SPIN model checker – and observed that their hashing was a bottleneck. We implemented our incremental algorithm there and immediately saw up to a two-fold performance increase, depending on the model. Our algorithm and extensive experimental results have also recently been published at SPIN’08 (a well-known model checking conference)
Also, while profiling and measuring the effects of hashing in MoonWalker, we did detect that its garbage collector algorithm (not to be confused with the garbage collector in the virtual machine) had a far more significant stake – on average, 55%. We saw this as a great optimisation opportunity and also worked out a new algorithm to reduce this stake.
Measure, Interpret and Optimise
There are a lot of opportunities to optimise a model checker. But it’s important is that one optimises the things that matter: the algorithms that take the biggest relative stake during execution. We invented a slick algorithm that worked well on paper, but didn’t make a difference in MoonWalker, and we optimised prematurely.
Along with measuring bottlenecks, interpretation of the results is just as important – when we initially used ANTS profiler, we looked at the absolute times when we should have looked at the percentages. Proper use of a profiler is important here! The real lesson to be learned here is:
Measure first, interpret carefully and optimise sensibly.