July 15, 2017

Learn TDD with Codemanship

Finding Load-Bearing Code - Thoughts On Implementation

I've been unable to shake this idea about identifying the load-bearing code in our software.

My very rough idea was to instrument the code and then run all our system or customer tests and record how many times methods are executed. The more times a method gets used (reused), the more critical it may be, and therefore may need more of our attention to make sure it isn't wrong.

This could be weighted by estimates for each test scenario of how big the impact of failure could be. But in my first pass at a tool, I'm thinking method call counts would be a simple start.

So, the plan is to inject this code into the beginning of the body of every method in the code under test (C# example), using something like Roslyn or Reflection.Emit:



The MethodCallCounter could be something as simple as a wrapper to a dictionary:



And this code, too, could be injected into the assembly we're instrumenting, or a reference added to a teeny tiny Codemanship.LoadBearing DLL.

Then a smidgen of code to write the results to a file (e.g., a spreadsheet) for further analysis.

The next step would be to create a test context that knows how critical the scenario is, using the customer's estimate of potential impact of failure, and instead of just incrementing the method call count, actually adds this number. So methods that get called in high-risk scenarios are shown as bearing a bigger load.



External to this would be a specific kind of runner (e.g., NUnit runner, FitNesse, SpecFlow etc) that executes the tests while changing the FailureImpact value using information tagged in each customer test somehow.

Thoughts?



(PS. This is also kind of how I'd add logging to a system, in case you were wondering.)




July 13, 2017

Learn TDD with Codemanship

Do You Know Where Your Load-Bearing Code Is?

Do you know where your load-bearing code is?

In any system, there's some code that - if it were to fail - would be a big deal. Identifying that code helps us target our testing effort to where it's really needed.

But how do we find our load-bearing code? I'm going to propose a technique for measuring the "load-beariness" of individual methods. Let's call it criticality.

Working with your customer, identify the potential impact of failure of specific usage scenarios. It's about like estimating the relative value of features, only this time we're not asking "what's it worth?". We're asking "what's the potential cost of failure?" e.g., applying the brakes in an ABS system would have a relatively very high cost of system failure. Changing the font on a business report would have a relatively low cost of failure. Maybe it's a low-risk feature by itself, but will be used millions of times every day, greatly amplifying the risk.

Execute a system test case. See which methods were invoked end-to-end to pass the test. For each of those methods, assign the estimated cost of failure.

Now rinse and repeat with other key system test cases, adding the cost of failure to every method each scenario hits.

A method that's heavily reused in many low-risk scenarios could turn out to be cumulatively very critical. A method that's only executed once in a single very high-risk scenario could also be very critical.

As you play through each test case, you'll build a "heat map" of criticality in your code. Some areas will be safe and blue, some areas will be risky and red, and a few little patches of code may be white hot.

That is your load-bearing code. Target more exhaustive testing at it: random, data-driven, combinatorial, whatever makes sense. Test it more frequently. Inspect it carefully, many times with many pairs of eyes. Mathematically prove it's correct if you really need to. And, of course, do whatever you can to simplify it. Because simpler code is less likely to fail.

And you don't need code to make a start. You could calculate method criticality from, say, sequence diagrams, or from CRC cards, to give you a heads-up on how rigorous you may need to be implementing the design.





May 30, 2017

Learn TDD with Codemanship

Do You Write Automated Tests When You Spike?

So, I've been running this little poll on Twitter asking devs if they write automated tests when they're knocking up a prototype (or a "spike", as Extreme Programmers call it).




The responses so far have been interesting, if not entirely unexpected. About two thirds of rarely or never write automated tests for a spike.

Behind this is the ongoing debate about the limits of usefulness of such tests (and of TDD, if we take that a step further). Some devs believe that when a problem is small, or when they expect to throw away the code afterwards, automated tests add no value and just slow us down.

My own experience has been a slow but sure transition from not bothering with unit tests for spikes 15 years ago, to almost always writing some unit tests even on small experiments. Why? Because I've found - and I've measured myself doing it, so it's not just a feeling - I get my spike done faster when I have a bit of test scaffolding holding it up.

For sure, I'm not as rigorous about it as when I'm working on production code. The tests tends to be at a higher level, and there are fewer of them. I may break a few of my own TDD rules and have tests that ask more than one question, or I may not refactor the test code quite as fastidiously. But the tests are there, nevertheless. And I'm usually really grateful that I wrote some, as the experiment grows and maybe makes some unexpected twists and turns.

And if - as can happen - the experiment becomes part of the production code, I'm confident that what I've produced is just about good enough to be released and maintained. I'm not in the business of producing legacy code... not even by accident.

An example of one of my spikes, for a utility that combines arrays of test data for use with parameterised tests, gives you an idea of the level of discipline I might usually apply. Not quite production quality, but not that far off.

The spike took - in total - maybe a couple of days, and I was really grateful for the tests by the second day. In timed experiments, I've seen me tackle much smaller problems faster when I wrote automated tests for them as I went along. Which is why, for me, that seems to be the way to go. I get done sooner, with something that could potentially be released. It leaves the door open.

Other developers may find that they get done sooner without writing automated tests. With TDD, I'm very much in my comfort zone. They may be outside it. In those instances, they probably need to be especially disciplined about throwing that code away to remove the temptation of releasing unreliable, unmaintainable code.

They could rehabilitate it, writing tests after the fact and refactoring the code to give it a production sparkle. Some people refer to this process as "spike & stabilise". But, to me, it does rather sound like "code and fix". Because, technically, that's exactly what it is. And experience - not just mine, but a mountain of hard data going back decades - strongly suggests that code and fix is the slow route to delivery.

So I'm a little skeptical, to say the least.




May 16, 2017

Learn TDD with Codemanship

My Obligatory Annual Rant About People Who Warn That You Can Take Quality Too Far Like It's An Actual Thing That Happens To Dev Teams

If you teach developers TDD, you can guarantee to bump into people who'll warn you of the dangers of taking quality too far (dun-dun-duuuuuuun!)

"We don't write the tests first because it protects us from over-testing our code", said one person recently. Ah, yes. Over-testing. A common problem in software.

"You need to be careful not to refactor your code too much", said another. And many's the time I've looked at code and thought "This program is just too easy to understand!"

I can't help recalling the time a UK software company, whose main product had literally thousands of open bugs, hired a VP of Quality and sent him around the dev teams warning them that "perfection is the enemy of good enough". Because that was their problem; the software was just too good.

It seems to still pervade our industry's culture, this idea that quality is the enemy of getting things done, despite mountains of very credible evidence that - in the vast majority of cases - the reverse is true. Most dev teams would deliver sooner if they delivered better software. Not aiming for perfection is the enemy of getting shit done more accurately sums up the relationship between quality and productivity in our line of work.

That's not to say that there aren't any teams who have ever taken it too far. In safety-critical software, the costs ramp up very quickly for very modest improvements in reliability. But the fact is that 99.9% of teams are so far away from this asymptote that, from where they're standing, good enough and too good are essentially the same destination.

Worry about wasting time on silly misunderstandings about the requirements. Worry about wasting time fixing avoidable coding errors. Worry about wasting time trying to hack your way through incomprehensible spaghetti code to make changes. Worry about wasting your time doing the same repeatable tasks manually over and over again.

But you very probably needn't worry about over-testing your code. Or about doing too much refactoring. Or about making the software too good. You're almost certainly not in any immediate danger of that.







April 27, 2017

Learn TDD with Codemanship

20 Dev Metrics - The Story So Far

So, we're half way into my series 20 Dev Metrics, and this would seem like a good point to pause and reflect on the picture we're building so far.

I should probably stress - before we get into a heated debate about all of the 1,000,001 other things we could be measuring while we do software development - that my focus here is primarily on two things: sustaining the pace of innovation, and delivering reliable software.

For me, sustaining innovation is the ultimate requirements discipline. Whatever the customer asks for initially, we know from decades of experience that most of the value in what we deliver will be added as a result of feedback from using it. So the ability to change the code is critical to achieving higher value.



Increasing lead times is a symptom, and our metrics help us to pinpoint the cause. As code gets harder and harder to change, delivery slows down. We can identify key factors that increase the cost of change, and teams I've coached have reduced lead times dramatically by addressing them.

Also, when we deliver buggy code, more and more time is spent fixing bugs and less time spent adding value. It's not uncommon to find products that have multiple teams doing nothing but fixing bugs on multiple versions, with little or no new features being added. At that point, when the cost of changing a line of code is in the $hundreds, and the majority of dev effort is spent on bug fixes, your product is effectively dead.

And maybe I'm in a minority on this (sadly, I do seem to be), but I think - as software becomes more and more a part of our daily lives, and we rely on it more and more - what we deliver needs to be reliable enough to depend on. This, of course, is relative to the risks using the software presents. But I think it's fair to say that most software we use is still far from being reliable enough. So we need to up our game.



And there's considerable overlap between factors that make code reliable, and factors that impede change. In particular, code complexity increases the risks of errors as well as making the code harder to understand and change safely. And frequency and effectiveness of testing - e.g., automating manual tests - can have a dramatic impact, too.

My first 10 metrics will either reveal a virtuous circle of high reliability and a sustainable pace, or a vicious cycle of low quality and an exponentially increasing cost of change.

The interplay between the metrics is more complex than my simplistic diagrams can really communicate. In future posts, I'll be highlighting some other little networks of interacting metrics that can tell us more useful stuff about our code and the way we're creating it.







April 25, 2017

Learn TDD with Codemanship

20 Dev Metrics - 8. Test Assurance

Number 8 in my Twitter series 20 Dev Metrics is a very important one, when it comes to minimising the cost of changing code - Test Assurance.

If there's one established fact about software development, it's that the sooner we catch bugs the cheaper they are to fix. When we're changing code, the sooner we find out if we broke it, the better. This can have such a profound effect on the cost of changing software that Michael Feathers, in his book 'Working Effectively with Legacy Code' defines 'legacy code' as code for which we have no automated tests.

The previous metric, Changes per Test Run give us a feel for how frequently we're running our tests - an often-forgotten factor in test assurance - but it can't tell us how effective our tests would be at catching bugs.

We want an answer to the question: "If this code was broken, would we know?" Would one or more tests fail, alerting us to the problem. The probability that our tests will catch bugs is known as "test assurance".

Opinions differ on how best to measure test assurance. Probably because it's easier to measure, a lot of teams track code coverage of tests.



For sure, code that isn't covered by tests isn't being tested at all. If only half your code's covered, then the probability of bugs being caught in the half that isn't is definitely zero.

But just because code is executed in a test, that doesn't necessarily mean it's being tested. Arguably a more meaningful measure of assurance can be calculated by doing mutation testing - deliberately introducing errors into lines of code and then seeing if any tests fail.



Mutation testing's a bit like committing burglaries to measure how effective the local police are at detecting crimes. Be careful with automated mutation testing tools, though; they can throw up false positives (e.g., when a convergent iterative loop just takes a bit longer to fund the right answer after changing its seed value). But most of these tools are configurable, so you can learn what mutations work and what mutations tend to throw up false positives and adapt your configuration.






April 24, 2017

Learn TDD with Codemanship

20 Dev Metrics - 7. Changes per Test Run

Day seven in my Twitter series 20 Dev Metrics, and we turn our attention to frequency of testing. The economics of software development are very clear that the sooner we catch problems, the cheaper they're fixed. Therefore, the sooner after we make a change to the software, the sooner we should test it.

A good metric would be "Mean Time to Re-testing", but this is a difficult one to collect (you'd essentially need to write custom plug-ins for your IDE and test runner and be constantly monitoring).

A decent compromise is Changes per Test Run, which gives us a rough idea of how much testing we do in a given period vs. how much code changes. A team attempting to do Waterfall delivery would have a very high ratio because they leave testing until very late and do it manually maybe just several times. A team that runs automated tests overnight would have a lower ratio. And a team doing TDD would have the lowest ratio.



To calculate this metric, we need to take the measure for code churn we used in a previous dev metric (tools are available for Git, Subversion, CVS etc to collect this), and tally each time we execute regression tests - be it manual testing or automated. If it's automated, and using an open source testing tool, we can usually adapt the test runner to keep score. A customer JUnit runner, for example, could ping a metrics data server on the network asynchronously, which just keeps a running total for that day/week/month.



That passes no comment at all, of course, on how effective those tests are at catching bugs, which we'll leave for another day.




April 23, 2017

Learn TDD with Codemanship

The Win-Win-Win of Clean Code

A conversation I had with a development team last week has inspired me to write a short post about the Win-Win-Win that Clean Code can offer us.

Code that is easier to understand, made of simpler parts, low in duplication and supported by good, fast-running automated tests tends to be easier to change and cheaper to evolve.

Code that is easier to understand, made of simpler parts, low in duplication and supported by good, fast-running automated tests also tends to be more reliable.

And code that is easier to understand, made of simpler parts, low in duplication and supported by good, fast-running automated tests - it turns out - tends to require less effort to get working.

It's a tried-and-tested marketing tagline for many products in software development - better, faster, cheaper. But in the case of Clean Code, it's actually true.

It's politically expedient to talk in terms of "trade-offs" when discussing code quality. But, in reality, show me the team who made their code too good. With very few niche exceptions - e.g., safety-critical code - teams discover that when they take more care over code quality, they don't pay a penalty for it in terms of productivity.

Unless, of course, they're new to the practices and techniques that improve code quality, like unit testing, TDD, refactoring, and all that lovely stuff. Then they have to sacrifice some productivity to the learning curve.

Good. I'm glad we had this little chat.



Learn TDD with Codemanship

20 Dev Metrics - 6. Complexity (Likelihood of Failure)

The sixth in my Twitter series 20 Dev Metrics has proven, over the decades, to be a reasonably good predictor of which code is most likely to have bugs - Complexity.

Code, like all machines, has a greater probability of going wrong when there are more things in it that could be wrong - more ways of being wrong. It really is as straightforward as that.

There are different ways of measuring code complexity, and they all have their merits. Size is an obvious one. Two lines of code tends to have twice as many ways of being wrong. It's not a linear relationship, though. Four lines of code isn't twice as likely as two lines of code to be buggy. It could be twice as likely again, depending on the extent to which lines of code interact with each other. To illustrate with a metaphor; four people are much more than twice as likely as two people to have a disagreement. The likelihood of failure grows exponentially with code size.

Another popular measure is cyclomatic complexity. This tells us how many unique paths through a body of code exist, and therefore how many tests it might take to cover every path.



Less popular, but still useful, is Halstead Volume, which was used back in the CMMi days as a predictor of the maintenance cost of code. It's a bit more sophisticated than measuring lines of code, calculating the size of 'vocabulary' that the code uses, so a line of code that employs more variables - and therefore more ways of being wrong - would have a greater Halstead Volume.

All of these metrics are easily calculated, and many tools are available to collect them. They make most sense at the method level, since that's the smallest unit of potential test automation. But complexity aggregated at the class level can indicate classes that do too much and need splitting up.





April 22, 2017

Learn TDD with Codemanship

20 Dev Metrics - 5. Impact of Failure

The fifth in my Twitter series 20 Dev Metrics builds on a previous metric, Feature Usage, to estimate the potential impact on end users when a piece of code fails.

Impact of Failure can be calculated by determining the critical path through the code (the call stack, basically) in a usage scenario and then assigning the value of Feature Usage to each method (or block of code, if you want to get down to that level of detail).



We do this for each usage scenario or use case, and add Feature Usage to methods every time they're in the critical path for the current scenario.

So, if a method is invoked in many high-traffic scenarios, our score for Impact of Failure will be very high.

What you'll get at the end is a kind of "heat map" of your code, showing which parts of your code could do the most damage if they broke. This can help you to target testing and other quality assurance activities at the highest-value code more effectively.

This is, of course, a non-trivial metric to collect. You'll need a way to record what methods get invoked when, say, you run your customer tests. And each customer test will need an associated value for Feature Usage. Ideally, this would be a feature of customer testing tools. But for now, you'll have some DIY tooling to do, using whatever instrumentation and/or meta-programming facilities your language has available. You could also use a code coverage reporting tool, generating reports one customer test at a time to see which code was executed in that scenario.

In the next metric, we'll look at another factor in code risk that we can use to help us really pinpoint QA efforts.