April 25, 2016

...Learn TDD with Codemanship

Mutation Testing & "Debuggability"

More and more teams are waking up to the benefit of checking the levels of assurance their automated tests give them.

Assurance, as opposed to coverage, answers a more meaningful question about our regression tests: if the code was broken, how likely is it that our tests would catch that?

To answer that question, you need to test your tests. Think of bugs as crimes in your code, and your tests as police officers. How good are your code police at detecting code crimes? One way to check would be to deliberately commit code crimes - deliberately break the code - and see if any tests fail.

This is a practice called mutation testing. We can do it manually, while we pair - I'm a big fan of that - and we can do it using one of the increasingly diverse (and rapidly improving) mutation testing tools available.

For Java, for example, there are tools like Jester and PIT. What they do is take a copy of your code (with unit tests), and "mutate" it - that is, make a single change to a line of code that (theoretically) should break it. Examples of automated mutations include turning a + into a -, or a < into <=, or ++ into --, and so on.

After it's created a "mutant" version of the code, it runs the tests. If one or more tests fail, then they are said to have "killed the mutant". If no test fails, then the mutant survives, and we may need to have a think about whether that line of code that was mutated is being properly tested. (Of course, it's complicated, and there will be some false positives where the mutation tool changed something we don't really care about. But the results tend to be about 90% useful, which is a boon, IMHO.)

Here's a mutation testing report generated by PIT for my Combiner spike:



Now, a lot of this may not be news for many of you. And this isn't really what this blog post is about.

What I wanted to draw your attention to is that - once I've identified the false positives in the report - the actual level of assurance looks pretty high (about 95% of mutations I cared about got killed.) Code coverage is also pretty high (97%).

While my tests appear to be giving me quite high assurance, I'm worried that may be misleading. When I write spikes - intended to as proof of concept and not to be used in anger - I tend to write a handful of tests that work at a high level.

This means that when a test fails, it may take me some time to pinpoint the cause of the problem, as it may be buried deep in the call stack, far removed from the test that failed.

For a variety of good reasons, I believe that tests should stick close to the behaviour being tested, and have only one reason to fail. So when they do fail, it's immediately obvious where and what the problem might be.

Along with a picture of the level of assurance my tests give me, I'd also find it useful to know how far removed from the problem they are. Mutation testing could give me an answer.

When tests "kill" a mutant version of the code, we know:

1. which tests failed, and
2. where the bug was introduced

Using that information, we can calculate the depth of the call stack between the two. If multiple tests catch the bug, then we take the shallowest depth out of those tests.

This would give me an idea of - for want of a real word - the debuggability of my tests (or rather, the lack of it). The shallower the depth between bugs and failing tests, the higher the debuggability.

I also note a relationship between debuggability and assurance. In examining mutation testing reports, I often find that the problem is that my tests are too high-level, and if I wrote more focused tests closer to the code doing that work, they would catch edge cases I didn't think about at that higher level.




Posted 1 year, 6 months ago on April 25, 2016