April 27, 2017

Learn TDD with Codemanship

20 Dev Metrics - The Story So Far

So, we're half way into my series 20 Dev Metrics, and this would seem like a good point to pause and reflect on the picture we're building so far.

I should probably stress - before we get into a heated debate about all of the 1,000,001 other things we could be measuring while we do software development - that my focus here is primarily on two things: sustaining the pace of innovation, and delivering reliable software.

For me, sustaining innovation is the ultimate requirements discipline. Whatever the customer asks for initially, we know from decades of experience that most of the value in what we deliver will be added as a result of feedback from using it. So the ability to change the code is critical to achieving higher value.



Increasing lead times is a symptom, and our metrics help us to pinpoint the cause. As code gets harder and harder to change, delivery slows down. We can identify key factors that increase the cost of change, and teams I've coached have reduced lead times dramatically by addressing them.

Also, when we deliver buggy code, more and more time is spent fixing bugs and less time spent adding value. It's not uncommon to find products that have multiple teams doing nothing but fixing bugs on multiple versions, with little or no new features being added. At that point, when the cost of changing a line of code is in the $hundreds, and the majority of dev effort is spent on bug fixes, your product is effectively dead.

And maybe I'm in a minority on this (sadly, I do seem to be), but I think - as software becomes more and more a part of our daily lives, and we rely on it more and more - what we deliver needs to be reliable enough to depend on. This, of course, is relative to the risks using the software presents. But I think it's fair to say that most software we use is still far from being reliable enough. So we need to up our game.



And there's considerable overlap between factors that make code reliable, and factors that impede change. In particular, code complexity increases the risks of errors as well as making the code harder to understand and change safely. And frequency and effectiveness of testing - e.g., automating manual tests - can have a dramatic impact, too.

My first 10 metrics will either reveal a virtuous circle of high reliability and a sustainable pace, or a vicious cycle of low quality and an exponentially increasing cost of change.

The interplay between the metrics is more complex than my simplistic diagrams can really communicate. In future posts, I'll be highlighting some other little networks of interacting metrics that can tell us more useful stuff about our code and the way we're creating it.







April 26, 2017

Learn TDD with Codemanship

20 Dev Metrics - 10. Duplication

Number 10 in my series 20 Dev Metrics is another contributor to higher cost of changing code, Duplication.

When we repeat code, we multiply the cost of changing any of that common logic. We also potentially multiply instances of the same bug. (It's not like in your school exams, where repeating an error you already made doesn't cost you any more marks. Repeated bugs hurt just as much.)

For this reason, we seek to minimise code duplication - except when it makes the code easier to understand.



Thankfully, we don't have to go trawling through our projects, comparing every block of code to every other block of code. There are tools that can do this for us, like conQAT and Simian.



The most interesting about duplicate code is the hints it can give us about generalisations and abstractions that would improve our designs, and so we find that when refactoring code, it's a good thread to pull on.




Learn TDD with Codemanship

20 Dev Metrics - 9. Readability

The 9th metric in my Twitter series 20 Dev Metrics is as important as it is contentious. Readability of code is a key factor in how easy it will be to change. Changing code we don't understand is like tinkering with car engines when we don't how they work. You're very likely to end up with something broken.

But, although there's pretty much universal acknowledgement of how critical readability is, opinions vary widely on what makes code readable and how we can assess its comprehensibility. Students may be told by their programming lecturers to write lots of helpful comments - the more comments, the easier the code is to understand.

Practitioners, on the other hand, may view over-reliance on comments as a sign that the code is not as self-explanatory as it could be, and seek to refactor the code to eliminate the need for them.

And how can we know if code is more or less readable?

By far the best indicator of how easy to understand code is is to see if other programmers can understand it. Well, obviously. My favoured method is the "code pop quiz"; before code can be committed, other developers have to answer some basic questions about the code to show they understand what it does and how it works. As with usability testing, when they can't answer a question, that indicates a failing in the code's ability to explain itself.



This is manual testing, though. And manual testing doesn't scale well. (Although many would argue that, as with inspections, making time for it would pay dividends later.)

Are there any automated - or automatable - ways we could at least use to target code that's more likely to be difficult to understand? Yes. and here are a few suggestions:

1. Complexity - more complex code tends to be harder to understand (tools available)

2. Flesch-Kincaid Reading Ease - is a metric often applied to teaching materials (tools available)

3. Keyword Density - is another metric applied to all kinds of texts (tools available)

4. Requirements-Code Keyword Overlap - can be calculated to show how closely the language of your code follows the language of the requirements specs (build your own tool)

5. Density of Comments - because, despite what your lecturer may have told you, comments really are an indication of unreadable code, and have their own maintenance overhead. (tools available)

6. Formatting - yes, really. The way we format code can have a pronounced effect on its readability. Don't believe me? Remove all the whitespace from some Java code and then show it to a colleague. (Automatable)

7. Conformance to Naming Conventions - can help, too (except when a dysfunctional conventions forces us to mangle names) (tools available)

You'll need a way to parse names in code so that, say, "checkAvailability" becomes "check availability", and so on.

Code that's more complex, has a low reading ease score, high keyword density and a poor alignment with the language of the problem domain may warrant more of your attention.




April 25, 2017

Learn TDD with Codemanship

20 Dev Metrics - 8. Test Assurance

Number 8 in my Twitter series 20 Dev Metrics is a very important one, when it comes to minimising the cost of changing code - Test Assurance.

If there's one established fact about software development, it's that the sooner we catch bugs the cheaper they are to fix. When we're changing code, the sooner we find out if we broke it, the better. This can have such a profound effect on the cost of changing software that Michael Feathers, in his book 'Working Effectively with Legacy Code' defines 'legacy code' as code for which we have no automated tests.

The previous metric, Changes per Test Run give us a feel for how frequently we're running our tests - an often-forgotten factor in test assurance - but it can't tell us how effective our tests would be at catching bugs.

We want an answer to the question: "If this code was broken, would we know?" Would one or more tests fail, alerting us to the problem. The probability that our tests will catch bugs is known as "test assurance".

Opinions differ on how best to measure test assurance. Probably because it's easier to measure, a lot of teams track code coverage of tests.



For sure, code that isn't covered by tests isn't being tested at all. If only half your code's covered, then the probability of bugs being caught in the half that isn't is definitely zero.

But just because code is executed in a test, that doesn't necessarily mean it's being tested. Arguably a more meaningful measure of assurance can be calculated by doing mutation testing - deliberately introducing errors into lines of code and then seeing if any tests fail.



Mutation testing's a bit like committing burglaries to measure how effective the local police are at detecting crimes. Be careful with automated mutation testing tools, though; they can throw up false positives (e.g., when a convergent iterative loop just takes a bit longer to fund the right answer after changing its seed value). But most of these tools are configurable, so you can learn what mutations work and what mutations tend to throw up false positives and adapt your configuration.






April 24, 2017

Learn TDD with Codemanship

20 Dev Metrics - 7. Changes per Test Run

Day seven in my Twitter series 20 Dev Metrics, and we turn our attention to frequency of testing. The economics of software development are very clear that the sooner we catch problems, the cheaper they're fixed. Therefore, the sooner after we make a change to the software, the sooner we should test it.

A good metric would be "Mean Time to Re-testing", but this is a difficult one to collect (you'd essentially need to write custom plug-ins for your IDE and test runner and be constantly monitoring).

A decent compromise is Changes per Test Run, which gives us a rough idea of how much testing we do in a given period vs. how much code changes. A team attempting to do Waterfall delivery would have a very high ratio because they leave testing until very late and do it manually maybe just several times. A team that runs automated tests overnight would have a lower ratio. And a team doing TDD would have the lowest ratio.



To calculate this metric, we need to take the measure for code churn we used in a previous dev metric (tools are available for Git, Subversion, CVS etc to collect this), and tally each time we execute regression tests - be it manual testing or automated. If it's automated, and using an open source testing tool, we can usually adapt the test runner to keep score. A customer JUnit runner, for example, could ping a metrics data server on the network asynchronously, which just keeps a running total for that day/week/month.



That passes no comment at all, of course, on how effective those tests are at catching bugs, which we'll leave for another day.




April 23, 2017

Learn TDD with Codemanship

20 Dev Metrics - 6. Complexity (Likelihood of Failure)

The sixth in my Twitter series 20 Dev Metrics has proven, over the decades, to be a reasonably good predictor of which code is most likely to have bugs - Complexity.

Code, like all machines, has a greater probability of going wrong when there are more things in it that could be wrong - more ways of being wrong. It really is as straightforward as that.

There are different ways of measuring code complexity, and they all have their merits. Size is an obvious one. Two lines of code tends to have twice as many ways of being wrong. It's not a linear relationship, though. Four lines of code isn't twice as likely as two lines of code to be buggy. It could be twice as likely again, depending on the extent to which lines of code interact with each other. To illustrate with a metaphor; four people are much more than twice as likely as two people to have a disagreement. The likelihood of failure grows exponentially with code size.

Another popular measure is cyclomatic complexity. This tells us how many unique paths through a body of code exist, and therefore how many tests it might take to cover every path.



Less popular, but still useful, is Halstead Volume, which was used back in the CMMi days as a predictor of the maintenance cost of code. It's a bit more sophisticated than measuring lines of code, calculating the size of 'vocabulary' that the code uses, so a line of code that employs more variables - and therefore more ways of being wrong - would have a greater Halstead Volume.

All of these metrics are easily calculated, and many tools are available to collect them. They make most sense at the method level, since that's the smallest unit of potential test automation. But complexity aggregated at the class level can indicate classes that do too much and need splitting up.





April 22, 2017

Learn TDD with Codemanship

20 Dev Metrics - 5. Impact of Failure

The fifth in my Twitter series 20 Dev Metrics builds on a previous metric, Feature Usage, to estimate the potential impact on end users when a piece of code fails.

Impact of Failure can be calculated by determining the critical path through the code (the call stack, basically) in a usage scenario and then assigning the value of Feature Usage to each method (or block of code, if you want to get down to that level of detail).



We do this for each usage scenario or use case, and add Feature Usage to methods every time they're in the critical path for the current scenario.

So, if a method is invoked in many high-traffic scenarios, our score for Impact of Failure will be very high.

What you'll get at the end is a kind of "heat map" of your code, showing which parts of your code could do the most damage if they broke. This can help you to target testing and other quality assurance activities at the highest-value code more effectively.

This is, of course, a non-trivial metric to collect. You'll need a way to record what methods get invoked when, say, you run your customer tests. And each customer test will need an associated value for Feature Usage. Ideally, this would be a feature of customer testing tools. But for now, you'll have some DIY tooling to do, using whatever instrumentation and/or meta-programming facilities your language has available. You could also use a code coverage reporting tool, generating reports one customer test at a time to see which code was executed in that scenario.

In the next metric, we'll look at another factor in code risk that we can use to help us really pinpoint QA efforts.




April 21, 2017

Learn TDD with Codemanship

20 Dev Metrics - 4. Feature Usage

The next few metrics in my 20 Dev Metrics Twitter series are going to help us understanding which parts of our code present the greatest risk for failure.

One metric that I'm always amazed to learn teams aren't collecting is Feature Usage. For a variety of reasons, it's useful to know which features of your software are used all the time and which features are used rarely (if ever).



When it comes to the quality of our code, feature usage is important because it guides our efforts towards things that matter more. It's one component of risk of failure that we need to know in order to effectively gauge how reliable any line of code (or method, or module, or component/service) needs to be. (We'll look at other components soon.)

Keeping logs of which features are being used is relatively straightforward; for a web application, your web server's logs might reveal that information (if the action being performed is revealed in the URL somehow). If you can't get them for free like that, though, adding usage logging needs some thought.

What you definitely don't want to do is add logging code willy-nilly all across your code. That way lies madness. Look for ways to encapsulate usage logging in a single part of the code. For example, for a desktop application, using the Command pattern offers an opportunity to start with a Command base class or template class that logs an action, then invokes the helper method that does the actual work. Logs can be stored in memory during the user's session and occasionally sent as a small batch to a shared data store if performance is a problem.

For a web application or web service, the Front Controller pattern can be used to implement logging before any work is done. (e.g., for an ASP.NET application, a custom logging HttpModule.) Whatever the architecture, there's usually a way.

And feature usage is good to know for other reasons, of course. Marketing and product management would be interested, for a start. We're all guessing at what will have value. As soon as we deliver working software that people are using, it's time to test our theories.



April 20, 2017

Learn TDD with Codemanship

20 Dev Metrics - 3. Dev Effort Breakdown

The third in my Twitter series 20 Dev Metrics is Development Effort Breakdown.



I've come across many, many organisations who have software and systems that, with each release, require more and more time fixing bugs instead of adding or changing features. Things can get so bad that some products end up with almost all the available time being spent on fixes, and users having to wait months or years for requested changes - if they ever come at all.

This can get very expensive, too. A company who made a call centre management product found themselves needing multiple teams maintaining multiple versions of the product that were in use, all devoting most of their time just to bug fixes.

I'm suggesting this metric instead of the classic "defects per thousand lines of code", because what matters most about bugs - after the trouble they cause end users - is how they soak up developer time, leaving less time to add value to the product.

To track this metric, teams need to record roughly how much time they've spent completing work (with the usual caveat about definitions of "done"). If you're using story cards, a simple system is for the developers to take a moment to record how many person-days/hours were spent on it on the card itself. Your project admin can then tot up the numbers in a spreadsheet.

Remember to include bugs reported in testing as well as production. If someone reported it, and someone had to fix it, then it counts.



April 19, 2017

Learn TDD with Codemanship

20 Dev Metrics - 2. Cost of Changing Code

The second in my Twitter series 20 Dev Metrics is the Cost of Changing Code. This is a brute force metric - anything involving measuring lines of code tends to be - so handle with care. The temptation will be for teams to game this, and it's very easily gamed. As with all metrics, cost of changing code should be used as a diagnostic tool, not as weaponised as a target or as a stick to beat dev teams with.

NB: My advice is to collect these metrics under the radar, and use them to provide hindsight when establishing a link between the way you write code and the results the business gets.

The aim is to establish the trend for cost of change, so we can link it to our first metric, Lead Time to Delivery. There are two things we need to track in order to calculate this simple metric.

1. How much code is changing (lines added, modified and deleted) - often called "code churn". There are tools available for various VCS solutions that can do this for you.



2. How much that change cost (in developer time or wages). Your project admin should have this information available in some form. If there's one thing teams are measuring, it's usually cost.



Divide developer cost by code churn, and - hey presto - you have the cost of changing a line of code.

All teams experience an increase in the cost of changing code, but better teams tend to experience a flatter gradient. The worst teams have a "hockey stick" trend, where change becomes prohibitively expensive alarmingly soon. e.g., one software company, after 8 years, had a cost of change 40x higher than at the start. At the same time, they tend to see lead times growing exponentially, as delivery gets slower and slower.

I dare your boss not to care about these first two metrics!