Auto-generating LShift blog posts

By: on April 26, 2013

I’ve often found myself at a loss for blog post topics, so rather than write one myself I decided to let a computer do the heavy lifting!

Markov chains offer a neat trick for generating surrealist blog oeuvres. They work by figuring out the probability of one word appearing after another, given a suitable corpus of input material.

The meat of the algorithm is surprisingly simple. Given a sequence of tokens (words and punctuation characters), you build a mapping between tokens and the frequencies of the tokens that appear one step to their right.

To give it a whirl, clone the GitHub repo and make sure you have Leiningen installed. Here’s a fairly noddy example:

$ lein repl
user=> (use 'blog-o-matic.core)
user=> (build-frequency-map ["the" "cat" "sat" "on" "the" "mat"])
{"on" {"the" 1}, "sat" {"on" 1}, "cat" {"sat" 1}, "the" {"mat" 1, "cat" 1}

Using the frequency map, you derive a new stream of tokens from a particular starting token and following the trail by making a weighted random choice among the available next tokens. The random-next-token function takes care of this. There’s also some helpers for stitching the tokens back together and spitting out sentences. Here’s a sample run based on the last twenty blog posts:

user=> (require '[blog-o-matic.scrape :as scrape])
user=> (def posts (take 20 (scrape/fetch-posts)))
user=> (def posts-tokenised (map scrape/tokenise posts))
user=> (def freqs (reduce build-frequency-map {} posts-tokenised))
user=> (first (random-sentences freqs))
"With github-differ, I won't regret it to serve as mentioned on BitBucket."
user=> (apply str (interpose " " (take 3 (random-sentences freqs))))
"Partly that stuff, and relatedly, by MongoDB on a sense therefore that spits
out the above. Android users won't just inserted into the rub: The app wants
your submitter claims to click around to define your table and do. We're using
a provisional response codes, and take a good work."

The devil is always in the details, and Markov chains are sensitive to the peculiarities of the input text. As is so often the case in the software world, 99% of the code is munging data into shape and 1% is a nifty algorithm. LShift blog posts present a challenge because you’ll find identifiers, magic numbers and great heaving code stanzas slap-bang in the middle of a sentence. Those bits have to be stripped out or the algorithm will get lost down a blind alley of symbology.

You may need to run it a few times before it produces anything giggle-worthy. Automated silliness detection is left as an exercise for the reader…


Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>