You start off meaning to contribute a tiny tweak and then forget about it, and you end up spending hours writing reams of multithreaded C code.
A friend posted to Twitter asking for help optimizing his code to find double word squares. I spotted a small optimization: instead of copying his working state for his recursive search, he could just undo his changes, and pass around a reference to a single shared data structure. A 10% boost!
However, not long after that it started to look like we’d nearly exhausted the room for optimization in Python, and were nowhere near being able to solve the more difficult cases. The inner loop seemed simple enough, and conveniently doesn’t directly do any string manipulation, instead handling references. So it seemed easy enough to convert to C. It still seemed like I might stop there. But no; then came the multithreading (very hacky at first, then cleaned up) and the replacement of the implausibly huge hash table with direct calculation at the point I needed it for a big saving in memory usage, to the extent we could hope to fit in L3 cache.
Finally, it was optimized enough that, given less than a day to work, it could handle any wordlength. I can report that with our dictionary, there are no double word squares for words of 9 letters or more, and for 8 letters all 42 double word squares are word squares (ie symmetrical around the leading diagonal). Rather sad!