February 21, 2008

...Learn TDD with Codemanship

Would File Compression Reveal Extent Of Code Duplication?

Here's my thought for the day.

If I were writing a text file compression algorithm, I might start by factoring out duplicate blocks of text (e.g., words or phrases that are used repeatedly) and replace them with short symbols that tell me what text to reinsert when I decompress the file.

In that sense, compression works by removing duplication.

Which sounds a bit like refactoring, doesn't it? Well, it does a little to me, anyway.

And if I wanted to estimate how much duplication there is in a text file, I might think about comparing the size of the compressed file to the uncompressed file. The greater the amount of compression, surely the greater the duplication must be?

And that got me to wondering: would that work on source code files, too? If ran my compression algorithm over a Java project and compared the size of the compressed file to the uncompressed source folders, would that give me a rough - or even a relative - indication of how much repetition there is in my code?

I could be talking complete nonsense, of course. But it's just a thought.

Posted 10 years, 5 months ago on February 21, 2008