Nat Pryce wrote a fun little library the other day called code-words. It rips your source into words, and turns the words into a wordcloud. In short, a visual representation of the most common words in your source, and using font size to indicate the more common terms. The aim is to give an introduction into the concepts the code speaks about. And in playing around with it I noticed something interesting.
I thought I’d take a look at EventMachine. Initially I ran code-words against the whole of the EventMachine repository, and it seems that words like “test” and “assert” are really important? Oh, right. A sign of good test coverage. OK, so running code-words just against the lib directory gave something a bit more useful:
Something looks a bit odd. OK, sure, I’d expect “event” to feature prominently – EventMachine is a Reactor, after all. And “data” is a very common word; it’d probably feature in every wordcloud. But I use EventMachine for network stuff, so I’d expected to see words like “connection” or “http” featuring prominently. Looking at code-word’s source, I see that comments are ignored. So just for kicks I hacked the code a bit, and produced another word cloud only on comments. And look what happened:
Suddenly “connection” features rather prominently! Clearly, connections are things worthy of comment. OK, so what happens if I run code-words on both source and comments?
Hm, back to the old stand-bys. What conclusions might we draw from this? Partly that the amount of comments is dwarfed by the amount of code: not surprising. Only deep magic requires seriously more comment than code. Mainly I find it interesting the the language used to describe code, and the language used in the code looks rather different. How different is the language in your codebase used between the comments and code?