January 18, 2017

Learn TDD with Codemanship

How Long Would Your Organisation Last Without Programmers?

A little straw poll I did recently on Twitter has proved to be especially timely after a visit to the accident & emergency ward at my local hospital (don't worry - I'm fine).

It struck me just how reliant hospitals have become on largely bespoke IT systems (that's "software" to me and you). From the moment you walk in to see the triage nurse, there's software - you're surrounded by it. The workflow of A&E is carefully controlled via a system they all access. There are computerised machines that take your blood pressure, monitor your heart, peer into your brain and build detailed 3D models, and access your patient records so they don't accidentally cut off the wrong leg.

From printing out prescriptions to writing notes to your family doctor, it all involves computers now.

What happens if all the software developers mysteriously disappeared overnight? There'd be nobody to fix urgent bugs. Would the show-stoppers literally stop the show?

I can certainly see how that would happen in, say, a bank. And I've worked in places where - without 24/7 bug-fixing support - they'd be completely incapable of processing their overnight orders, creating a massive and potentially un-shiftable backlog that could crush their business in a few weeks.

Ultimately, DR is all about coping in the short term, and getting business-as-usual (or some semblance of it) up and running as quickly as possible. It can delay, but not avoid, the effects of having nobody who can write or fix code.

And I'm aware that big organisations have "disaster recovery" plans. I've been privy to quite a few in my lofty position as Chief Technical Arguer in some very large businesses. But all the DR plans I've seen have never asked "what happens if there's nobody to fix it?"

Smart deployers, of course, can just roll back a bad release to the last one that worked, ensuring continuity... for a while. But I know business code: even when it's working, it's often riddled with unknown bugs, waiting to go off, like little business-killing landmines. I've fixed bugs in COBOL that were written in the 1960s.

Realistically, rolling back your systems by potentially decades is not an option. You probably don't even have access to that old code. Or, if you do, someone will have to re-type it all in from code listings kept in cupboards.

And even if you could revert your way back to a reliable system with potential longevity, without the ability to change and adapt those systems to meet future needs will soon start to eat away at your organisation from the inside.

It's food for thought.

October 15, 2016

Learn TDD with Codemanship

If Your Code Was Broken, Would You Know?

I've been running a little straw poll among friends and clients, as well as on social media, to get a feel for what percentage of development teams routinely (or continuously) measure the level of assurance their automated regression tests give them.

For me, it's a fundamental question: if my code was broken, would I know?

The straw poll suggests that about 90% of teams don't ask that question often, and 80% don't ask it at all.

The whole point of automated tests is to give us early, cheap detection of new bugs that we might have introduced as we change the code. So profound is their impact, potentially, that Michael Feathers - in his book Working Effectively With Legacy Code - defines "legacy code" as code for which we have no automated tests.

I've witnessed first-hand the impact automating regression tests can have on delivery schedules and development costs. Which is why that question is often on my mind.

The best techniques I know for "testing your tests" are:

1. A combination of the "golden rule" of Test-Driven Development (only write source code if a failing test requires it, so all the code is executed by tests), and running tests to make sure their assertions fail when the result is wrong.

2. Mutation testing - deliberately introducing programming errors to see if the tests catch them

I put considerable emphasis on the first practice. As far as I'm concerned, it's fundamental to TDD, and a habit test-driven developers need to get into. Before you write the simplest code to pass a test, make sure it's a good test. If the answer was wrong, would this test fail?

The second practice, mutation testing, is rarely applied by teams. Which is a shame, because it's a very powerful technique. Code coverage tools only tell what code definitely isn't being executed in tests. Mutation testing tells us what code isn't being meaningfully tested, even if it is being executed by tests. It specifically asks "If I broke this line of code, would any tests fail?"

The tools for automated mutation testing have improved greatly in recent years, and support across programming languages is growing. If you genuinely want to know how much assurance your tests can give you - i.e., how much confidence you can have that the code really works - then you need to give mutation testing a proper look.

Here are some mutation testing tools that might be worth having a play with:

Java - PIT

C# - VisualMutator

C/C++ - Plextest

Ruby - Mutant

Python - Cosmic Ray

PHP - Humbug

September 13, 2016

Learn TDD with Codemanship

4 Things You SHOULDN'T Do When The Schedule's Slipping

It takes real nerve to do the right thing when your delivery date's looming and you're behind on your plan.

Here are four things you should really probably avoid when the schedule's slipping:

1. Hire more developers

It's been over 40 years since the publication of Fred L. Brooks' 'The Mythical Man-Month'. This means that our industry has known for almost my entire life that adding developers to a late project makes it later.

Not only is this born out by data on team size vs. productivity, but we also have a pretty good idea what the causal mechanism is.

Like climate change, people who reject this advice should not be called "skeptics" any more. In the face of the overwhelming evidence, they're Small Team Deniers.

Hiring more devs when the schedule's slipping is like prescribing cigarettes, boxed sets and bacon for a patient with high blood pressure.

2. Cut corners

Still counterintuitively, for most software managers, the relationship between software quality and the time and cost of delivery is not what most of us think it is.

Common sense might lead us to believe that more reliable software takes longer, but the mountain of industry data on this clearly shows the opposite in the vast majority of cases.

To a point - and it's a point 99% of teams are in no danger of crossing - it actually takes less effort to deliver more reliable software.

Again, the causal mechanism for this is well understood. And, again, anyone who rejects the evidence is not a "skeptic"; they're a Defect Prevention Denier.

The way to go faster on 99% of projects is to slow down, and take more care.

3. Work longer hours

Another management myth that's been roundly debunked by the evidence is that, when a software delivery schedule's slipping significantly, teams can get back on track by working longer hours.

The data very clearly shows that - for most kinds of work - longer hours is a false economy. But it's especially true for writing software, which requires a level of concentration and focus that most jobs don't.

Short spurts of extra effort - maybe the odd weekend or late night - can make a small difference in the short term, but day after day, week after week overtime will burn your developers out faster than you can say "get a life". They'll make stupid, easily avoidable mistakes. And, as we've seen, mistakes cost exponentially more to fix than to avoid. This is why teams who routinely work overtime tend to have lower overall productivity: they're too busy fighting their own self-inflicted fires.

You can't "cram" software development. Like your physics final exams, if you're nowhere near ready a week before, then you're not gong to be ready, and no amount of midnight oil and caffeine is going to fix that.

You'll get more done with teams who are rested, energised, feeling positive, and focused.

4. Bribe the team to hit the deadline

Given the first three points we've covered here, promising to shower the team with money and other rewards to hit a deadline is just going to encourage them to make those mistakes for you.

Rewarding teams for hitting deadlines fosters a very 1-dimensional view of software development success. It places extra pressure on developers to do the wrong things: to grow the size of their teams, to cut corners, and to work silly hours. It therefore has a tendency to make things worse.

The standard wheeze, of course, is for teams to pretend that they hit the deadline by delivering something that looks like finished software. The rot under the bonnet quickly becomes apparent when the business then expects a second release. Now the team are bogged down in all the technical debt they took on for the first release, often to the extent that new features and change requests become out of the question.

Yes, we hit the deadline. No, we can't make it any better. You want changes? Then you'll have to pay us to do it all over again.

Granted, it takes real nerve, when the schedule's slipping and the customer is baying for blood, to keep the team small, to slow down and take more care, and to leave the office at 5pm.

Ultimately, the fate of teams rests with the company cultures that encourage and reward doing the wrong thing. Managers get rewarded for managing bigger teams. Developers get rewarded for being at their desk after everyone else has gone home, and appearing to hit deadlines. Perversely, as an industry, it's easier to rise to the top by doing the wrong thing in these situations. Until we stop rewarding that behaviour, little will change.

May 25, 2016

Learn TDD with Codemanship

How Many Bugs In Your Code *Really*?

A late night thought, courtesy of the very noisy foxes outside my window.

How many bugs are lurking in your code that you don't know about?

Your bug tracking database may suggest you have 1 defect per thousand lines of code (KLOC), but maybe that's because your tests aren't very thorough. Or maybe it's because you deter users from reporting bugs. I've seen it all, over the years.

But if you want to get a rough idea of how many bugs there are really, you can use a kind of mutation testing.

Create a branch of your code and deliberately introduce 10 bugs. Do your usual testing (manual, automated, whatever it entails), and keep an eye on bugs that get reported. Stop the clock at the point you'd normally be ready to ship it. (But if shipping it *is* your usual way of testing, then *start* the clock there and wait a while for users to report bugs.)

How many of those deliberate bugs get reported? If all 10 do, then the bug count in your database is probably an accurate reflection of the actual number of bugs in the code.

If 5 get reported, then double the bug count in your database. If your tracking says 1 bug/KLOC, you probably have about 2/KLOC.

If none get reported, then your code is probably riddled with bugs you don't know about (or have chosen to ignore.)

April 25, 2016

Learn TDD with Codemanship

Mutation Testing & "Debuggability"

More and more teams are waking up to the benefit of checking the levels of assurance their automated tests give them.

Assurance, as opposed to coverage, answers a more meaningful question about our regression tests: if the code was broken, how likely is it that our tests would catch that?

To answer that question, you need to test your tests. Think of bugs as crimes in your code, and your tests as police officers. How good are your code police at detecting code crimes? One way to check would be to deliberately commit code crimes - deliberately break the code - and see if any tests fail.

This is a practice called mutation testing. We can do it manually, while we pair - I'm a big fan of that - and we can do it using one of the increasingly diverse (and rapidly improving) mutation testing tools available.

For Java, for example, there are tools like Jester and PIT. What they do is take a copy of your code (with unit tests), and "mutate" it - that is, make a single change to a line of code that (theoretically) should break it. Examples of automated mutations include turning a + into a -, or a < into <=, or ++ into --, and so on.

After it's created a "mutant" version of the code, it runs the tests. If one or more tests fail, then they are said to have "killed the mutant". If no test fails, then the mutant survives, and we may need to have a think about whether that line of code that was mutated is being properly tested. (Of course, it's complicated, and there will be some false positives where the mutation tool changed something we don't really care about. But the results tend to be about 90% useful, which is a boon, IMHO.)

Here's a mutation testing report generated by PIT for my Combiner spike:

Now, a lot of this may not be news for many of you. And this isn't really what this blog post is about.

What I wanted to draw your attention to is that - once I've identified the false positives in the report - the actual level of assurance looks pretty high (about 95% of mutations I cared about got killed.) Code coverage is also pretty high (97%).

While my tests appear to be giving me quite high assurance, I'm worried that may be misleading. When I write spikes - intended to as proof of concept and not to be used in anger - I tend to write a handful of tests that work at a high level.

This means that when a test fails, it may take me some time to pinpoint the cause of the problem, as it may be buried deep in the call stack, far removed from the test that failed.

For a variety of good reasons, I believe that tests should stick close to the behaviour being tested, and have only one reason to fail. So when they do fail, it's immediately obvious where and what the problem might be.

Along with a picture of the level of assurance my tests give me, I'd also find it useful to know how far removed from the problem they are. Mutation testing could give me an answer.

When tests "kill" a mutant version of the code, we know:

1. which tests failed, and
2. where the bug was introduced

Using that information, we can calculate the depth of the call stack between the two. If multiple tests catch the bug, then we take the shallowest depth out of those tests.

This would give me an idea of - for want of a real word - the debuggability of my tests (or rather, the lack of it). The shallower the depth between bugs and failing tests, the higher the debuggability.

I also note a relationship between debuggability and assurance. In examining mutation testing reports, I often find that the problem is that my tests are too high-level, and if I wrote more focused tests closer to the code doing that work, they would catch edge cases I didn't think about at that higher level.

April 23, 2016

Learn TDD with Codemanship

Does Your Tech Idea Pass The Future Dystopia Test?

One thing that at times fascinates and at times appals me is the social effect that web applications can have on us.

Human beings learn fast, but evolve slowly. Hence we can learn to program a video recorder, but living a life that revolves around video recorders can be toxic to us. For all our high-tech savvy, we are still basically hominids, adapted to run from predators and pick fleas off of each other, but not adapted for Facebook or Instagram or Soundcloud.

But the effects of online socialisation are now felt in the Real World - you know, the one we used to live in? People who, just 3-4 years ago, were confined to expressing their opinions on YouTube are now expressing them on my television and making pots of real money.

Tweets are building (and ending) careers. Soundcloud tracks are selling out tours. Facebook viral posts are winning elections. MySpace users are... well, okay, maybe not MySpace users.

For decades, architects and planners obsessed over the design of the physical spaces we live and work in. The design of a school building, they theorise, can make a difference to the life chances of the students who learn in it. The design of a public park can increase or decrease the chances of being attacked in it. Pedestrianisation of a high street can breath new life into local shops, and an out-of-town shopping mall can suck the life out of a town centre.

Architects must actively consider the impact of buildings on residents, on surrounding communities, on businesses, on the environment, when they create and test their designs. Be it for a 1-bed starter home, or for a giant office complex, they have to think about these things. It's the law.

What thought, then, do software developers give to the social, economic and environmental impact of their application designs?

With a billion users, a site like Facebook can impact so many lives just by adding a new button or changing their privacy policy.

Having worked on "Web 2.0" sites of all shapes and sizes, I have yet to see teams and management go out of their way to consider such things. Indeed, I've seen many occasions when management have proposed features of such breath-taking insensitivity to wider issues, that it's easy to believe that we don't really think much about it at all. That is, until it all goes wrong, and the media are baying for our blood, and we're forced to change to keep our share price from crashing.

This is about more than reliability (though reliability would be a start).

Half-jokingly, I've suggested that teams put feature requests through a Future Dystopia Test; can we imagine a dark, dystopian, Philip K Dick-style future in which our feature has caused immense harm to society? Indeed, whole start-up premises fail this test sometimes. Just hearing some elevator pitches conjures up Blade Runner-esque and Logan's Run-ish images.

I do think, though, that we might all benefit from devoting a little time to considering the potential negative effects of what we're creating before we create it, as well as closely monitoring those effects once it's out there. Don't wait for that hysterical headline "AcmeChat Ate My Hamster" to appear before asking yourself if the fun hamster-swallowing feature the product owner suggested might not be such a good thing after all.

This blog post is gluten free and was not tested on animals

April 18, 2016

Learn TDD with Codemanship

SC2016 Mini-Project: Code Risk Heat Map

Software Craftsmanship 2016 - Mini-Project

Code Risk Heat Map

Estimated Duration: 2-4 hours

Author: Jason Gorman, Codemanship

Language(s)/stacks: Any


Create a tool that produces a "heat map" of your code, highlighting the parts (methods, classes, packages) that present the highest risk of failure.

Risk should be classified along 3 variables:

1. Potential cost of failure

2. Potential risk of failure

3. Potential system impact of failure (based on dependencies)

Use colour coding/gradation to visually highlight the hotspots (e.g., red = highest risk, green = lowest risk)

Also, produce a prioritised list of individual methods/functions in the code with the same information.


April 15, 2016

Learn TDD with Codemanship

Compositional Coverage

A while back, I blogged about how the real goal of OO design principles is composability of software - the ability to wire together different implementations of the same abstractions to make our code do different stuff (or the same stuff, differently).

I threw in an example of an Application that could be composed of different combinations of database, external information service, GUI and reporting output.

This example design offers us 81 unique possible combinations of Database, Stock Data, View and Output for our application. e.g., A Web GUI with an Oracle database, getting stock data from Reuters and writing reports to Excel files.

A few people who discussed the post with me had concerns, though. Typically, in software, more combinations means more ways for our software to be wrong. And they're quite right. How do we assure ourselves that every one of the possible combinations of components will work as a complete whole?

A way to get that assurance would be to test all of the combinations. Laborious, potentially. Who wants to write 81 integration tests? Not me, that's for sure.

Thankfully, parameterised testing, with an extra combinatorial twist, can come to the rescue. Here's a simple "smoke test" for our theoretical design above:

This parameterised test accepts each of the different kind of component as a parameter, which it plugs into the Application through the constructor. I then use a testing utility I knocked up to generate the 81 possible combinations (the code for which can be found here - provided with no warranty, as it was just a spike).

When I run the test, it checks the trade price calculation using every combination of components. Think of it like that final test we might do for a car after we've checked all the individual components work correctly - when we bolt them all together, and turn the key in the ignition, does it go?.

The term I'm using for how many possible combinations of components we've tested is compositional coverage. In this example, I've achieved 100% compositional coverage, as every possible combination is tested.

Of course, this is a dummy example. The components don't really do anything. But I've simulated the possible cost of integration tests by building in a time delay, to illustrate that these ain't your usual fast-running unit tests. In our testing pyramid, these kinds of tests would be near the top, just below acceptance and system tests. We wouldn't run them after, say, every refactoring, because they'd be too slow. But we might run them a few times a day.

More complex architectures may generate thousands of possible combinations of components, and lead to integration tests (or "composition tests") that take hours to run. In these situations, we could probably buy ourselves pretty decent compositional coverage by doing pairwise combinations (and, yes, the testing utility can do that, too).

Changing that test to use pairwise combinations reduces the number of tests run to just 9.

April 9, 2016

Learn TDD with Codemanship

Seeking Guinea Pigs For New Training Workshop on Property-Based Testing

Currently in development at Codemanship HQ is a new 1-day training workshop on property-based testing, designed to take intermediate-to-advanced TDD-ers to the next level in writing highly reliable software economically.

To help iron out any kinks, I'm keen to test-drive the workshop this summer. An exact date hasn't been confirmed yet, but some time in July or August on a Saturday in South West London. Tickets will be free in exchange for giving it a fair hearing, throwing yourself into the exercises with gusto, and providing feedback to help debug the workshop.

If it goes well, I may also ask very politely of you'd like to record a wee testimonial to help market the course when it officially launches to paying customers, for which I'll be very grateful.

If this sounds interesting, and you'd like to be contacted when registration for the trial workshop opens, drop me a line.

March 20, 2016

Learn TDD with Codemanship

Property-based Testing: Return of the Son of Formal Methods

For many years now, I've been biding my time, waiting for the day when Formal Methods go mainstream.

If you're new to the concept, Formal Methods are essentially just the mathematically precise specification and testing of software and systems.

While you can take the verification side of Formal Methods to very expensive extremes, depending on how critical the integrity of your code is, they all start with a specification.

This needs to be written in an unambiguous - and therefore testable - specification language. In the 80s and 90s, formal specification languages like Z (and its object oriented nephew, Object Z), VDM and B were invented for this purpose.

These languages were somewhat unapproachable for mainstream programmers, often using mathematical notations that probably only somebody who had knowledge of first-order logic and set theory might be familiar with. Programmers, typically, work with text.

So, as UML began to take off, Formal Methods folks invented text-based specification languages for adding precise rules to their system models, like the Object Constraint Language. Naively, some of the inventors believed that OCL would be "friendly" enough to be understood by business stakeholders. Realistically, it was friendly enough to be used by programmers.

But OCL never really gained widespread acceptance, either. And that was chiefly for two reasons: firstly, programmers already have text-based languages they can precisely specify rules in. i.e., programming languages. Martin Fowler once commented to me at a workshop on Agile and Formal Methods many moons ago that we may as well write the code if we're going to precisely specify it. It was a sentiment echoed by many in the Agile community.

Interestingly, we had no such qualms about writing executable tests for our code. That was the other reason formal specification never really took off: Test-driven Development kind of sort of stole its thunder.

There is little doubt - well, none really - that TDD can produce software that's far more reliable than average. And arguably, in the hands of the right programmer, it may be possible to match the quality standards being achieved in safety-critical software development with TDD. Someone closely associated with OCL, John Daniels, made that claim to after I gave a talk on TDD - again, a looong time ago. So it's not a new idea, that's for sure.

I also now know from practical experience that John was right: you can push TDD to the point - in terms of the discipline - where high-integrity software is achievable.

The way I've done it, and seen it done, is through refactoring the test code into a form that can serve as an easy jumping-off point for getting much higher test assurance relatively cheaply. Most commonly, refactoring a bunch of similar test methods into a single parameterised test opens up many possibilities for testing further, like combinatorial testing, random testing, and even model checking.

Tools that can generate very large numbers of inputs, through which we can more exhaustively test our code, can help us make the leap from "very reliable" software to "extremely reliable software" for a relatively small investment. For this reason, I encourage developers doing TDD to refactor to parameterised tests when the opportunity arises. They're a good practical foundation for writing high-integrity code.

When we do TDD, usually we work with specific examples of inputs and expected outputs. So we write assertions like:

assertTrue(sqrroot(16) == 4)

This doesn't work for test inputs that are being generated programmatically, though. For random and combinatorial testing, and certainly for model-checking, we need to generalise our test assertions, because we don't know the expected result in advance. So we need to write assertions that can be evaluated for any test input.

assertTrue(sqrroot(input) * sqrroot(input) == input)

While it's becoming fashionable to refer to these kinds of generalised assertions as "properties", and using tests that make generalised assertions as "property-based testing", this is - in all essence - formal specification.

Styles vary, of course. Some advocate moving the assertions into the source code itself, to be evaluated at the exact point where the rule must be satisfied, so whenever that code is executed - perhaps in debug mode - the rule is evaluated.

Writing assertions in the code about what should be true while it's executing is an idea cooked in the very early days of computing, and advocated for checking the correctness of programs by Alan Turing, among others. Most modern programming languages have something like an assert keyword to allow us to say "at this point, x should be true" (and if x isn't true, then our program is wrong).

All we need to do is plug in a whole bunch of randomly generated inputs, and we can check if average() is ever called with an empty array of numbers. If it is, then our software is broken.

Model checking tools like NASA's Java Pathfinder give us choices about how we specify the rules of our software. We can embed them in the source code using proprietary specification languages and conventions, or we can write them in JUnit tests as Theories.

What all this adds up to is a "return of the son of Formal Methods". It's not quite formal specification as I might remember it, but it most definitely is formal specification, even if it's going by a different name these days.

Personally, I strongly approve. Although tool support for property-based testing needs to evolve, and to join hands with the best of the formal methods tools out there like Pathfinder, so we can finally get a practical workflow that could take us from vanilla example-based TDD all the way up to the most rigorous forms of verification.

I have higher hopes for property-based testing, for two good reasons:

1. Programmers can write their specs in the same language they program in, so the learning curve is shallower

2. Programming languages themselves are evolving into a more compatible declarative style, making it easier to express more sophisticated rules

But I genuinely hope it catches on widely. I've been waiting a long time for it.