April 27, 2017

Learn TDD with Codemanship

20 Dev Metrics - The Story So Far

So, we're half way into my series 20 Dev Metrics, and this would seem like a good point to pause and reflect on the picture we're building so far.

I should probably stress - before we get into a heated debate about all of the 1,000,001 other things we could be measuring while we do software development - that my focus here is primarily on two things: sustaining the pace of innovation, and delivering reliable software.

For me, sustaining innovation is the ultimate requirements discipline. Whatever the customer asks for initially, we know from decades of experience that most of the value in what we deliver will be added as a result of feedback from using it. So the ability to change the code is critical to achieving higher value.



Increasing lead times is a symptom, and our metrics help us to pinpoint the cause. As code gets harder and harder to change, delivery slows down. We can identify key factors that increase the cost of change, and teams I've coached have reduced lead times dramatically by addressing them.

Also, when we deliver buggy code, more and more time is spent fixing bugs and less time spent adding value. It's not uncommon to find products that have multiple teams doing nothing but fixing bugs on multiple versions, with little or no new features being added. At that point, when the cost of changing a line of code is in the $hundreds, and the majority of dev effort is spent on bug fixes, your product is effectively dead.

And maybe I'm in a minority on this (sadly, I do seem to be), but I think - as software becomes more and more a part of our daily lives, and we rely on it more and more - what we deliver needs to be reliable enough to depend on. This, of course, is relative to the risks using the software presents. But I think it's fair to say that most software we use is still far from being reliable enough. So we need to up our game.



And there's considerable overlap between factors that make code reliable, and factors that impede change. In particular, code complexity increases the risks of errors as well as making the code harder to understand and change safely. And frequency and effectiveness of testing - e.g., automating manual tests - can have a dramatic impact, too.

My first 10 metrics will either reveal a virtuous circle of high reliability and a sustainable pace, or a vicious cycle of low quality and an exponentially increasing cost of change.

The interplay between the metrics is more complex than my simplistic diagrams can really communicate. In future posts, I'll be highlighting some other little networks of interacting metrics that can tell us more useful stuff about our code and the way we're creating it.







April 25, 2017

Learn TDD with Codemanship

20 Dev Metrics - 8. Test Assurance

Number 8 in my Twitter series 20 Dev Metrics is a very important one, when it comes to minimising the cost of changing code - Test Assurance.

If there's one established fact about software development, it's that the sooner we catch bugs the cheaper they are to fix. When we're changing code, the sooner we find out if we broke it, the better. This can have such a profound effect on the cost of changing software that Michael Feathers, in his book 'Working Effectively with Legacy Code' defines 'legacy code' as code for which we have no automated tests.

The previous metric, Changes per Test Run give us a feel for how frequently we're running our tests - an often-forgotten factor in test assurance - but it can't tell us how effective our tests would be at catching bugs.

We want an answer to the question: "If this code was broken, would we know?" Would one or more tests fail, alerting us to the problem. The probability that our tests will catch bugs is known as "test assurance".

Opinions differ on how best to measure test assurance. Probably because it's easier to measure, a lot of teams track code coverage of tests.



For sure, code that isn't covered by tests isn't being tested at all. If only half your code's covered, then the probability of bugs being caught in the half that isn't is definitely zero.

But just because code is executed in a test, that doesn't necessarily mean it's being tested. Arguably a more meaningful measure of assurance can be calculated by doing mutation testing - deliberately introducing errors into lines of code and then seeing if any tests fail.



Mutation testing's a bit like committing burglaries to measure how effective the local police are at detecting crimes. Be careful with automated mutation testing tools, though; they can throw up false positives (e.g., when a convergent iterative loop just takes a bit longer to fund the right answer after changing its seed value). But most of these tools are configurable, so you can learn what mutations work and what mutations tend to throw up false positives and adapt your configuration.






April 24, 2017

Learn TDD with Codemanship

20 Dev Metrics - 7. Changes per Test Run

Day seven in my Twitter series 20 Dev Metrics, and we turn our attention to frequency of testing. The economics of software development are very clear that the sooner we catch problems, the cheaper they're fixed. Therefore, the sooner after we make a change to the software, the sooner we should test it.

A good metric would be "Mean Time to Re-testing", but this is a difficult one to collect (you'd essentially need to write custom plug-ins for your IDE and test runner and be constantly monitoring).

A decent compromise is Changes per Test Run, which gives us a rough idea of how much testing we do in a given period vs. how much code changes. A team attempting to do Waterfall delivery would have a very high ratio because they leave testing until very late and do it manually maybe just several times. A team that runs automated tests overnight would have a lower ratio. And a team doing TDD would have the lowest ratio.



To calculate this metric, we need to take the measure for code churn we used in a previous dev metric (tools are available for Git, Subversion, CVS etc to collect this), and tally each time we execute regression tests - be it manual testing or automated. If it's automated, and using an open source testing tool, we can usually adapt the test runner to keep score. A customer JUnit runner, for example, could ping a metrics data server on the network asynchronously, which just keeps a running total for that day/week/month.



That passes no comment at all, of course, on how effective those tests are at catching bugs, which we'll leave for another day.




April 23, 2017

Learn TDD with Codemanship

The Win-Win-Win of Clean Code

A conversation I had with a development team last week has inspired me to write a short post about the Win-Win-Win that Clean Code can offer us.

Code that is easier to understand, made of simpler parts, low in duplication and supported by good, fast-running automated tests tends to be easier to change and cheaper to evolve.

Code that is easier to understand, made of simpler parts, low in duplication and supported by good, fast-running automated tests also tends to be more reliable.

And code that is easier to understand, made of simpler parts, low in duplication and supported by good, fast-running automated tests - it turns out - tends to require less effort to get working.

It's a tried-and-tested marketing tagline for many products in software development - better, faster, cheaper. But in the case of Clean Code, it's actually true.

It's politically expedient to talk in terms of "trade-offs" when discussing code quality. But, in reality, show me the team who made their code too good. With very few niche exceptions - e.g., safety-critical code - teams discover that when they take more care over code quality, they don't pay a penalty for it in terms of productivity.

Unless, of course, they're new to the practices and techniques that improve code quality, like unit testing, TDD, refactoring, and all that lovely stuff. Then they have to sacrifice some productivity to the learning curve.

Good. I'm glad we had this little chat.



Learn TDD with Codemanship

20 Dev Metrics - 6. Complexity (Likelihood of Failure)

The sixth in my Twitter series 20 Dev Metrics has proven, over the decades, to be a reasonably good predictor of which code is most likely to have bugs - Complexity.

Code, like all machines, has a greater probability of going wrong when there are more things in it that could be wrong - more ways of being wrong. It really is as straightforward as that.

There are different ways of measuring code complexity, and they all have their merits. Size is an obvious one. Two lines of code tends to have twice as many ways of being wrong. It's not a linear relationship, though. Four lines of code isn't twice as likely as two lines of code to be buggy. It could be twice as likely again, depending on the extent to which lines of code interact with each other. To illustrate with a metaphor; four people are much more than twice as likely as two people to have a disagreement. The likelihood of failure grows exponentially with code size.

Another popular measure is cyclomatic complexity. This tells us how many unique paths through a body of code exist, and therefore how many tests it might take to cover every path.



Less popular, but still useful, is Halstead Volume, which was used back in the CMMi days as a predictor of the maintenance cost of code. It's a bit more sophisticated than measuring lines of code, calculating the size of 'vocabulary' that the code uses, so a line of code that employs more variables - and therefore more ways of being wrong - would have a greater Halstead Volume.

All of these metrics are easily calculated, and many tools are available to collect them. They make most sense at the method level, since that's the smallest unit of potential test automation. But complexity aggregated at the class level can indicate classes that do too much and need splitting up.





April 22, 2017

Learn TDD with Codemanship

20 Dev Metrics - 5. Impact of Failure

The fifth in my Twitter series 20 Dev Metrics builds on a previous metric, Feature Usage, to estimate the potential impact on end users when a piece of code fails.

Impact of Failure can be calculated by determining the critical path through the code (the call stack, basically) in a usage scenario and then assigning the value of Feature Usage to each method (or block of code, if you want to get down to that level of detail).



We do this for each usage scenario or use case, and add Feature Usage to methods every time they're in the critical path for the current scenario.

So, if a method is invoked in many high-traffic scenarios, our score for Impact of Failure will be very high.

What you'll get at the end is a kind of "heat map" of your code, showing which parts of your code could do the most damage if they broke. This can help you to target testing and other quality assurance activities at the highest-value code more effectively.

This is, of course, a non-trivial metric to collect. You'll need a way to record what methods get invoked when, say, you run your customer tests. And each customer test will need an associated value for Feature Usage. Ideally, this would be a feature of customer testing tools. But for now, you'll have some DIY tooling to do, using whatever instrumentation and/or meta-programming facilities your language has available. You could also use a code coverage reporting tool, generating reports one customer test at a time to see which code was executed in that scenario.

In the next metric, we'll look at another factor in code risk that we can use to help us really pinpoint QA efforts.




January 18, 2017

Learn TDD with Codemanship

How Long Would Your Organisation Last Without Programmers?

A little straw poll I did recently on Twitter has proved to be especially timely after a visit to the accident & emergency ward at my local hospital (don't worry - I'm fine).



It struck me just how reliant hospitals have become on largely bespoke IT systems (that's "software" to me and you). From the moment you walk in to see the triage nurse, there's software - you're surrounded by it. The workflow of A&E is carefully controlled via a system they all access. There are computerised machines that take your blood pressure, monitor your heart, peer into your brain and build detailed 3D models, and access your patient records so they don't accidentally cut off the wrong leg.

From printing out prescriptions to writing notes to your family doctor, it all involves computers now.

What happens if all the software developers mysteriously disappeared overnight? There'd be nobody to fix urgent bugs. Would the show-stoppers literally stop the show?

I can certainly see how that would happen in, say, a bank. And I've worked in places where - without 24/7 bug-fixing support - they'd be completely incapable of processing their overnight orders, creating a massive and potentially un-shiftable backlog that could crush their business in a few weeks.

Ultimately, DR is all about coping in the short term, and getting business-as-usual (or some semblance of it) up and running as quickly as possible. It can delay, but not avoid, the effects of having nobody who can write or fix code.

And I'm aware that big organisations have "disaster recovery" plans. I've been privy to quite a few in my lofty position as Chief Technical Arguer in some very large businesses. But all the DR plans I've seen have never asked "what happens if there's nobody to fix it?"

Smart deployers, of course, can just roll back a bad release to the last one that worked, ensuring continuity... for a while. But I know business code: even when it's working, it's often riddled with unknown bugs, waiting to go off, like little business-killing landmines. I've fixed bugs in COBOL that were written in the 1960s.

Realistically, rolling back your systems by potentially decades is not an option. You probably don't even have access to that old code. Or, if you do, someone will have to re-type it all in from code listings kept in cupboards.

And even if you could revert your way back to a reliable system with potential longevity, without the ability to change and adapt those systems to meet future needs will soon start to eat away at your organisation from the inside.

It's food for thought.


October 15, 2016

Learn TDD with Codemanship

If Your Code Was Broken, Would You Know?

I've been running a little straw poll among friends and clients, as well as on social media, to get a feel for what percentage of development teams routinely (or continuously) measure the level of assurance their automated regression tests give them.

For me, it's a fundamental question: if my code was broken, would I know?

The straw poll suggests that about 90% of teams don't ask that question often, and 80% don't ask it at all.

The whole point of automated tests is to give us early, cheap detection of new bugs that we might have introduced as we change the code. So profound is their impact, potentially, that Michael Feathers - in his book Working Effectively With Legacy Code - defines "legacy code" as code for which we have no automated tests.

I've witnessed first-hand the impact automating regression tests can have on delivery schedules and development costs. Which is why that question is often on my mind.

The best techniques I know for "testing your tests" are:

1. A combination of the "golden rule" of Test-Driven Development (only write source code if a failing test requires it, so all the code is executed by tests), and running tests to make sure their assertions fail when the result is wrong.

2. Mutation testing - deliberately introducing programming errors to see if the tests catch them

I put considerable emphasis on the first practice. As far as I'm concerned, it's fundamental to TDD, and a habit test-driven developers need to get into. Before you write the simplest code to pass a test, make sure it's a good test. If the answer was wrong, would this test fail?

The second practice, mutation testing, is rarely applied by teams. Which is a shame, because it's a very powerful technique. Code coverage tools only tell what code definitely isn't being executed in tests. Mutation testing tells us what code isn't being meaningfully tested, even if it is being executed by tests. It specifically asks "If I broke this line of code, would any tests fail?"

The tools for automated mutation testing have improved greatly in recent years, and support across programming languages is growing. If you genuinely want to know how much assurance your tests can give you - i.e., how much confidence you can have that the code really works - then you need to give mutation testing a proper look.

Here are some mutation testing tools that might be worth having a play with:

Java - PIT

C# - VisualMutator

C/C++ - Plextest

Ruby - Mutant

Python - Cosmic Ray

PHP - Humbug





September 13, 2016

Learn TDD with Codemanship

4 Things You SHOULDN'T Do When The Schedule's Slipping

It takes real nerve to do the right thing when your delivery date's looming and you're behind on your plan.

Here are four things you should really probably avoid when the schedule's slipping:

1. Hire more developers

It's been over 40 years since the publication of Fred L. Brooks' 'The Mythical Man-Month'. This means that our industry has known for almost my entire life that adding developers to a late project makes it later.

Not only is this born out by data on team size vs. productivity, but we also have a pretty good idea what the causal mechanism is.

Like climate change, people who reject this advice should not be called "skeptics" any more. In the face of the overwhelming evidence, they're Small Team Deniers.

Hiring more devs when the schedule's slipping is like prescribing cigarettes, boxed sets and bacon for a patient with high blood pressure.

2. Cut corners

Still counterintuitively, for most software managers, the relationship between software quality and the time and cost of delivery is not what most of us think it is.

Common sense might lead us to believe that more reliable software takes longer, but the mountain of industry data on this clearly shows the opposite in the vast majority of cases.

To a point - and it's a point 99% of teams are in no danger of crossing - it actually takes less effort to deliver more reliable software.

Again, the causal mechanism for this is well understood. And, again, anyone who rejects the evidence is not a "skeptic"; they're a Defect Prevention Denier.

The way to go faster on 99% of projects is to slow down, and take more care.

3. Work longer hours

Another management myth that's been roundly debunked by the evidence is that, when a software delivery schedule's slipping significantly, teams can get back on track by working longer hours.

The data very clearly shows that - for most kinds of work - longer hours is a false economy. But it's especially true for writing software, which requires a level of concentration and focus that most jobs don't.

Short spurts of extra effort - maybe the odd weekend or late night - can make a small difference in the short term, but day after day, week after week overtime will burn your developers out faster than you can say "get a life". They'll make stupid, easily avoidable mistakes. And, as we've seen, mistakes cost exponentially more to fix than to avoid. This is why teams who routinely work overtime tend to have lower overall productivity: they're too busy fighting their own self-inflicted fires.

You can't "cram" software development. Like your physics final exams, if you're nowhere near ready a week before, then you're not gong to be ready, and no amount of midnight oil and caffeine is going to fix that.

You'll get more done with teams who are rested, energised, feeling positive, and focused.

4. Bribe the team to hit the deadline

Given the first three points we've covered here, promising to shower the team with money and other rewards to hit a deadline is just going to encourage them to make those mistakes for you.

Rewarding teams for hitting deadlines fosters a very 1-dimensional view of software development success. It places extra pressure on developers to do the wrong things: to grow the size of their teams, to cut corners, and to work silly hours. It therefore has a tendency to make things worse.

The standard wheeze, of course, is for teams to pretend that they hit the deadline by delivering something that looks like finished software. The rot under the bonnet quickly becomes apparent when the business then expects a second release. Now the team are bogged down in all the technical debt they took on for the first release, often to the extent that new features and change requests become out of the question.

Yes, we hit the deadline. No, we can't make it any better. You want changes? Then you'll have to pay us to do it all over again.


Granted, it takes real nerve, when the schedule's slipping and the customer is baying for blood, to keep the team small, to slow down and take more care, and to leave the office at 5pm.

Ultimately, the fate of teams rests with the company cultures that encourage and reward doing the wrong thing. Managers get rewarded for managing bigger teams. Developers get rewarded for being at their desk after everyone else has gone home, and appearing to hit deadlines. Perversely, as an industry, it's easier to rise to the top by doing the wrong thing in these situations. Until we stop rewarding that behaviour, little will change.








May 25, 2016

Learn TDD with Codemanship

How Many Bugs In Your Code *Really*?

A late night thought, courtesy of the very noisy foxes outside my window.

How many bugs are lurking in your code that you don't know about?

Your bug tracking database may suggest you have 1 defect per thousand lines of code (KLOC), but maybe that's because your tests aren't very thorough. Or maybe it's because you deter users from reporting bugs. I've seen it all, over the years.

But if you want to get a rough idea of how many bugs there are really, you can use a kind of mutation testing.

Create a branch of your code and deliberately introduce 10 bugs. Do your usual testing (manual, automated, whatever it entails), and keep an eye on bugs that get reported. Stop the clock at the point you'd normally be ready to ship it. (But if shipping it *is* your usual way of testing, then *start* the clock there and wait a while for users to report bugs.)

How many of those deliberate bugs get reported? If all 10 do, then the bug count in your database is probably an accurate reflection of the actual number of bugs in the code.

If 5 get reported, then double the bug count in your database. If your tracking says 1 bug/KLOC, you probably have about 2/KLOC.

If none get reported, then your code is probably riddled with bugs you don't know about (or have chosen to ignore.)