November 5, 2010

...Learn TDD with Codemanship

Consider Mutation Testing. No, Seriously.

Prompted by a tweet I noticed from @pkirkham with a link to what's probably the only book on mutation testing, costing an eyewatering 122 quid on Amazon, I thought it might help to give a little explanation of this relatively unknown, but potentially extremely powerful, practice.

When we write tests for our code, there is often a gap between the rules our tests will enforce and what the code actually does (and is, hopefully, supposed to do).

Let me use a simple example. Here's the implementation code for a small class that generates sequences of Fibonacci numbers of a specified length, which must be no shorter than 8 numbers and no longer than 50.



I ran EclEmma to get coverage stats, and every line of this code is covered by tests. But, as we all probably know from bitter experience, that doesn't mean this code is being 100% tested.

Mutation testing is a technique for testing our tests. Well, if our goal here is to have tests that will catch bugs, then the best test of our tests is whether or not they actually do catch bugs.

By deliberately introducing a bug - a basic programming error - and then running our tests, we can quickly see if the tests catch the bug. If they do, then great. But if we introduce a bug and no tests fail then there is obviously a hole in our automated testing mosquito net that we need to close.

Mutation testing was first proposed by Richard Lipton in a research paper in 1971. His idea is simplicity itself. We introduce a "mutation" into our code, resulting in a new, very slightly different "mutant" version of it. This is an instance of one of a pre-defined set of "mutation operators" which mimic common coding errors.

An example of a "mutation operator" might be to replace conditional operators, so || gets replaced with &&. Or to replace a relational operator so that == becomes <=.

In the Fibonacci example, I replaced the boolean expression "length > 50" with "false".



Once a single "mutant" has been created, we run our tests. In this case, all the tests passed because I'm only testing that an exception is thrown if sequences shorter than 8 numbers are requested, not longer than 50. So I need to add another test for the right-hand side of the || to plug the gap.

On the other hand, if I replaced the || with && a test fails. My tests successfully kill this mutant.



By randomly mutating individual lines of code, and individual expressions, we can quite thoroughly check that our code is being properly tested and that our tests will be likely to detect bugs.

Of course, if you have 1,000,000 lines of tested code, then you will never find the time or the resources to do this by hand. Luckily, automated mutation testing tools do exist. for Java, there is Jester, written by London's very own Ivan Moore. There's a .NET version called Nester, which is in need of updating, sadly.

There are degrees of mutation testing we can employ, depending on how reliable we need our tests to be vs. how much time and resources we have for mutation testing. Traditional, or "strong" mutation testing requires that entire programs be checked for each and every mutant to ensure its outputs are different to the unmutated version. In layman's terms, after each mutation we run all of our tests. For even a small program (say, 1000 LOC), this will obviously take some considerable time. For 1,000,000 LOC+ programs, we're talking server farms! (Though these days renting cloud-based computing muscle by the hour makes such things surprisingly economical.) Needless to say, strong mutation testing is not something you would do as part of your TDD cycle, or even as part of an integration build and test process. When I've used Jester and equivalents before, it's been used to run mutation tests overnight or over a weekend, producing daily/weekly reports of where the gaps in our tests might be.

And if this is all seems a bit technically beyond your average Java or C# coding shop, then don't despair. Many benefits of mutation testing can be achieved by much more pragmatic means. Adversarial pair programming can produce quite good results in this respect. Let's say I was pairing with someone on my Fibonacci code above. And I wrote this test:



Now, I intend that a specific kind of exception be thrown, in this case an IllegalArgumentException. I hand the keyboard to my pairing partner, and she plays devil's advocate by implementing it thus:



She runs the tests and they pass, even though that was definitely not what I'd intended. She hands the keyboard back to me and I fix the test to close the gap:



I know this all sounds like a lot of fuss and bother, but I will end by stressing two very important things I've learned about mutation testing:

1. It's not THAT complicated, once you get the hang of it. And it's not THAT much extra work
2. It can have a profound effect on the effectiveness of your tests and the resulting reliability of your code. This is "TDD++" in terms of the results it can achieve, using fairly ordinary work-a-day tools and skills. In my own experience, the benefits often vastly outweigh the costs.

Especially when you realise that these techniques can be applied to acceptance tests and therefore to defining the requirements. It's only a small conceptual jump from one programmer deliberately gaming another's unit tests to developers deliberately gaming acceptance tests with system-level mutations, and asking the customer "is that what you really intended?"


In summary, then, mutation testing - and the principles behind it - can be a very powerful tool at multiple levels of abstraction from low-level code up to requirements (in an ATDD/BDD style of development especially). You should check it out, you crazy cats!




Posted 7 years, 7 months ago on November 5, 2010