January 27, 2014
Coverage!!!A sleepless night spent thinking about test coverage, and these demons need to be exorcised (in blog form) so I can move on and think about other things, like breakfast and shoes and stuff.
Here's the thinking: I met up with the jolly nice Stuart Gale (@bishboria) last week and we talked about how the bar could be raised in mainstream software development as far as reliability of software is concerned.
A very sad story I became aware of today served as a timely reminder of the growing need for us to re-assess what we think of as a "critical system", and reminded me that current mainstream practices are just not sufficient.
Now, back to the theory. Bugs tend to hide in the places we rarely look for them. Hence, the more we test our software, the less buggy it tends to be. I fully appreciate that the industry solution to getting better test coverage to date has largely been to release the software, bugs and all, and let the users do the testing.
But, for those of us who think that approach is a tad - let's be generous - impolite, we might want to look for ways to really get into those nooks and crannies before the software ships.
If you've come into contact with code coverage tools like Emma and NCover, you'll be familiar with the concept of code coverage. This is a brute force measure that tells us which lines of code are executed when our tests run, and which aren't. While code coverage has little to say about whether our tests are effective enough, it does at least tell us with little ambiguity if code is not tested at all. This is our starter for ten.
If we're practicing TDD fairly rigorously, and following the Golden Rule that we don't write any production code unless we have a failing test that requires it, we'd expect code coverage to be high. Hooray for our side.
Of course, we could write meaningless tests (assertTrue(true) appears in more production code bases than you might imagine) and game the metric to get 100%-ish code coverage. Boo to the nasties.
A more meaningful measure might be to see how many lines of code are actually tested - that is to say, if we broke that line, a test would fail. Regular - but not necessarily frequent, because it's a slow process - mutation testing might reveal holes in our test safety net. That's a practice - automated by tools like Jester - in which we deliberately introduce programming errors into our code, run the tests, and if a test fails we say the test suite "killed the mutant". I highly recommend mutation testing, which is why no teams do it, obviously.
Talking of things that no teams do, let's dig even deeper into the test coverage rabbit hole.
So, we now know that every line of code - or as near as dammit - is executed when we run the tests, and we know that we can't break a line of code without the tests picking that up. But I know from experience that even code that's that well-tested will still likely have bugs lurking in it in the deep, dark corners of unusual, once-in-a-blue-moon combinations of inputs and conditions. We may think we've done enough and that the occasional bug popping up like the fictional town of Brigadoon is no big deal. Turns out, we're wrong. It all depends on the potential consequences of that bug. Winning the lottery may be a 1-in-14,000,000 long-shot, but people still win every week. If you're software has 800,000,000 users, performing billions of interactions every day, that 1-in-a-million bug is going to appear somewhere and potentially ruin someone's life.
So we dig deeper. Beyond code coverage is the possible combinations that software allows.
For example, a business rule may be driven by several conditional expressions, and there may be dozens of possible combinations of these conditionals (e.g., A AND B AND C, A AND NOT B AND NOT C etc etc). Did we test all the combinations we needed to test? let's talk about how we get better conditional coverage
How do we visualise these conditions and search for hidden combinations? Programmers have several useful techniques we can draw on. One of the most used is Decision Tables. A decision table is a lot like a logic truth table, and enables us to methodically enumerate all the possible logical combinations and potentially highlights interesting ones we might have missed.
If only for this, a decision table at the very least gives you a rough visual guide as to the actual logical complexity of the code.
Another kind of combination we might like to consider is the set of possible unique paths through the code. This can be especially useful when considering event-driven code. we may capture a user journey through a UI, for example, as a sequence of user actions to which the system responds. There may be many, many possible user journeys (same applies to the many possible lifecycles of objects, as depicted as a sequence of interactions with that object.) Did we consider all the interesting ones? Are their any sequences or paths through the code we missed in which a bug may be lurking, waiting to be sprung? If you've ever found yourself saying "I don't seem to be able to get off this screen", that's a path-based bug that was sprung on you.
For brevity, we call this path coverage (or sometimes sequence coverage). Because "set of possible unique paths through the code" won't fit on a t-shirt.
A popular tool for exploring path coverage is finite state machines. state transition tables. These look a teeny bit like truth tables and decision tables, but with the emphasis on event-driven logic and state transitions. As with those other techniques, a state transition table can help to draw out gaps in our understanding and find combinations we didn't think of. Another very powerful technique that almost nobody uses.
Finally, let's dip our toe into the London School of TDD, where the testing emphasis is not only object state and all that malarkey, but on the interactions between objects. Bertrand Meyer teaches us in his book Object Oriented Software Construction that software is only correct if all the interactions between its internal components are correct. In layman's terms, it doesn't matter if all our objects do what they're supposed to in isolation. They must also play nice with each other.
So we might concern ourselves with how thoroughly we've tested the interactions between our objects. The concept of interaction coverage is less widely known than code, conditional and path coverage, but arguably it's a concept whose time has come.
If you examine your code, you can count all the examples of one type of object interacting with another. In interaction-based testing, we're interested in whether those interactions are observing the rules we believe should govern them. Does the client satisfy the pre-condition before the interaction? Does the supplier give the client what they were expecting? Do any invariant rules still hold before and after the interaction? (If these are multi-threaded interactions, there will be even more rules, potentially about what must hold true throughout.)
Visualising the set of all possible interactions defined in our code is not as trivial as writing a decision table or drawing a state chart. There may be millions of them. Object oriented software complicates things further, by allowing us to dynamically define which concrete type of object is actually being interacted with at runtime through polymorphism. The App may interact with the Command interface, but at runtime, is it the Edit command, or Undo, or Save?
Likely as not, it would require special tools like the ones used to measure code coverage and comprehensive mutation testing.
And all of these kinds of coverage overlap to some degree, too. So much of what a state chart would tell you about test cases might be found from looking at a decision table, for example. But not everything, potentially.
I, for what it's worth, have found these techniques to be powerful, useful and practical on real commercial projects. Applied wisely, focusing the effort where the higher risks are, they can make bugs rare enough that I might even consider getting into a taxi driven by your software. Well, almost. But certainly, I would feel much safer and more confident using the everyday software that we all think of as non-critical that, when the blue moon shines above Brigadoon, turns out to be anything but.
UPDATE: Couple of things since I posted this ramble.
1. Chris Vest (@chvest) recommends pitest for mutation testing in Java. As for other languages, nit sure what's out there these days. It has to be said, Java often tends to be the target platform for this sort of thing. That could be a university thing. Maybe in 2020 they'll all be in Python.
2. Another kind of coverage that I meant to mention but forgot in my sleep-deprived stupor: concurrent coverage. This is a biiiiig topic. Think of all the possible sequences of actions in a single-threaded state machine. Now add a second state machine that accesses the same data at the same time. We now have to consider how the sequences of instructions in both processes might now be interleaved, and hence we get a potential explosion of possible paths to consider. Visualising the interactions of two or more threads on the same data can be helped with UML sequence diagrams, and this has often been the informal way I've thought about multithreaded logic. But, as with state charts, their informal nature can easily hide combinations we didn't think of, so on critical code, we may find value in techniques like model checking. But, I suspect, much work needs to be done to bring these kinds of tools into the mainstream.
Finally, before I forget, one extremely useful - and much overlooked - technique for improving the coverage of our tests is inspections. There's much to be gained by carefully reading our code and asking of each line "what could do wrong here?" Code inspections, as they exist in the messy, higgledy-piggledy world today, are not much cop, though. More rigorous techniques, like guided inspection, we pick test cases and step through the code in a kind of mock execution, can really help focus our minds and draw out issues we didn't spot.
UPDATE TO THE UPATE:
Oh yeah, almost forgot. Another great technique for improving test coverage is to make your code simpler. Simpler code tends to need less testing, because there are less things that could go wrong. It's important to be especially aware of this when agreeing customer requirements and writing acceptance tests. Once we've committed to a complex feature, our software must handle every potential input meaningfully, which generates a lot of test cases. You see, cost and time aren't the only variables to trade off. Reliability should be a key factor to consider.
Posted 4 years, 11 months ago on January 27, 2014