July 2, 2017

...Learn TDD with Codemanship

Conceptual Correlation - A Working Definition

During an enjoyable four days in Warsaw, Poland, I put some more thought into the idea of Conceptual Correlation as a code metric. (hours of sitting in airports, planes, buses, taxis, trains and hotel bars gives plenty of time for the mind to wander).

I've come up with a working definition to base a prototype tool on, and it goes something like this:

Conceptual Correlation - the % of non-noise words that appear in names of things in our code (class names, method names, field names, variable names, constants, enums etc) that also appear in the customer's description of the problem domain in which the software is intended for use.

That is, if we were to pull out all the names from our code, parse them into their individual words (e.g., submit_mortgage_application would become "submit" "mortgage" "application"), and build a set of them, then Conceptual Correlation would be the % of that set that appeared in a similar set created by parsing, say, a FitNesse Wiki test page about submitting mortgage applications.

So, for example, a class name like MortgageApplicationFactory might have a conceptual correlation of 67% (unless, of course, the customer actually processes mortgage applications in a factory).

I might predict that a team following the practices of Domain-Driven Design might write code with a higher conceptual correlation, perhaps with just the hidden integration code (database access, etc) bringing the % down. Whereas a team that are much more solution-driven or technology-driven might write code with a relatively lower conceptual correlation.

For a tool to be useful, it would not only report the conceptual correlation (e.g,, between a .NET assembly and a text file containing its original use cases), but also provide a way to visualise and access the "word cloud" to make it easier to improve the correlation.

So, if we wrote code like this for booking seats on flights, the tool would bring up a selection of candidate words from the requirements text to replace the non-correlated names in our code with.

I currently envisage this popping up as an aid when we use a Rename refactoring, perhaps accentuating words that haven't been used yet.

A refactored version of the code would show a much higher conceptual correlation. E.g.,

The devil's in the detail, as always. Would the tool need to make non-exact correlations, for example? Would "seat" and "seating" be a match? Or a partial match? Also, would the strength of the correlation matter? Maybe "seat" appears in the requirements text many times, but only once in the code. Should that be treated as a weaker correlation? And what about words that appear together? Or would that be making it too complicated? Methinks a simple spike might answer some of these questions.

Posted 2 weeks, 4 days ago on July 2, 2017