November 9, 2006
Code TremorsNow maybe I've got this wrong, but I've been trawling through the web looking for original research on the interconnectedness of software, and how that plays a part in the potential impact of changes. As I see it, this is a ubiquitous problem - like my lovely forest fires or the famous sandpile experiment. When systems are made up of lots of interconnected parts, changes to the system can propogate along those connections.
In the case of forest fires, flames can spread from one tree to a neighbouring tree. In the case of the sandpile experiment, one slipping grain of sand can cause a connected grain to slip. In both cases, the change might propogate only locally - having a small impact - or, in rare cases, it can propogate throughout huge areas of the system. Most forest fires only spread to a few acres, but every once in a blue moon they can consume millions of acres of forest.
Total system collapses, like huge forest fires (or earthquakes, or stock market crashes) can be triggered by the tiniest changes to the system. In the sandpile experiment, grains of sand are dropped randomly onto a table top. Over time, the sand builds up. And all the time, there are small - sometimes imperceptible - slips in the sand pile. But as the sand builds up and up and up, it reaches a point where the interconnectedness of the pile is so enormous where dropping a single grain of sand anywhere could cause a huge collapse.
This state of maximimum interconnectedness is sometimes referred to as the tipping point. At the tipping point, huge system collapses - phase transitions - are very much more likely and can be set off by the most inocuous events. To explain why, we need to go back to the forest fire game.
Imagine our forest area is a grid of squares. Trees are placed randomly in the grid, one at a time. At regular intervals - say, every twenty trees - a match is dropped on a random square. If the square contains a tree, then the tree burns. Trees in adjoining squares might also burn. Let's say that the chance of the fire spreading to an adjoining square is 50:50.
The probability that the fire might spread right across the grid is very, very small. But it gets much greater as the grid becomes more and more populated. The more trees there are next to each other, the more chances there are for the fire to spread. At the tipping point, the probability of this happening is at its highest.
We see the same effect in countless phenomena: for example, stock market fluctuations (where the connections exist between people buying and selling stocks, and there's a finite probability that Jim selling his Microsoft stocks might cause Jane to sell hers, creating fluctuations in the price. Every so often, this effect spreads much, much further and we get a run on a stock, which can - in very rare occasions - cause a catastrophic collapse in the entire market, like a million acre forest fire.)
And there's surely no doubting the interconnectedness of software. Anyone who has maintained legacy code will be painfully aware of it. So does this ubiquitous tipping point exist in software, too? Does software reach a point of such interconnectedness that even the tiniest change to one part of the code could propogate throughout the entire code base? And if it's impossible to predict which grain of sand will cause the pile to collapse, is it equally impossible to predict which change might cause a "codebase collapse" until we actually try to do it?
One hint might be in the distribution of what I'm going to call code tremors, and their relative sizes. Earthquakes are another product of interconnectedness (in their case, the interconnectness of rock slipping and sliding over other rock). When scientists examined a catalogue of measurements of the sizes of thousands of earth tremors and quakes, they discovered that the incidences of the smaller quakes outnumbered the larger quakes by an order of magnitude. So quakes that measured 1 on the Richter scale were an order of magnitude more common than quakes that measured 2. And quakes that measured 3 on the scale were an order of magnitude less common than 2. And so on. The biggest quakes were very rare - every few decades - but small tremors seemed to happen on a daily basis.
In the same way, there are small fluctuations in the stock market happening all the time, but stock market crashes happen only once every few decades (thankfully). And there are small forest fires breaking out all the time, but fires that wipe out millions of acres of forest happen maybe once in a lifetime.
Scientists refer to this as a power distribution. Power distributions in statistics are an indicator that the presence of the same statistical mechanism might be at work. It suggests some kind of interconnectedness and the presence of a tipping point - a critical mass where catastrophic system collapses become most probable - lurking in the problem somewhere.
If this applies to code, then we might expect to find some kind of power distribution when we examine the impact of changes. We would expect most changes to have minimal impact, and as the impact of change increases, we would expect the incidence of such impacts to grow exponentially smaller. And every once in a lifetime we might expect a tiny change in one part of the system to cause a catastrophic collapse of the codebase.
Changes to modules have a finite probability of propogating to dependent modules (change propogates from supplier to client)
One of the areas I'm keen to explore is how we might reason about the probability of change being propgated by looking at the nature of the dependency along which it might travel (e.g., changing the name of a method). And while syntactic dependencies are well understood, how might we reason about semantic dependencies where a change in behaviour might break a dependent module if left unchanged? Can we measure this interconnectedness and build a picture of what the tipping point looks like? Lots of lovely new questions!
Sometimes I quite like my job :-)
Posted 14 years, 8 months ago on November 9, 2006