There is an improv game called one-word-at-a-time. It goes like this: two (or more) people collaboratively compose a story, each adding just one word at a time.
It sounds easy, but actually humans can be pretty bad at this game! Some of the rules of improv are “be obvious” and “accept offers”. In the context of generating a story, ideally you’d want to choose the next obvious word given the words so far. That’s a superior strategy to trying to come up with a particularly funny or smart word, or trying to fit in that really great idea you had 5 words ago, even if it’s no longer fitting with the story. With humans, there is a danger of each participant building their own vision for the story, and ending up with two people with two stories in their heads, while the actual story does not progress.
Given that being simple and obvious is an advantage for this game I thought I might have a chance at writing a program that is a good enough player. All it has to do is look at the words in the story so far and produce a simple, syntactically correct suggestion.
My first idea was to use a markov chain. I needed some text to create it with, and since the goal is to create a story, I created a dataset composed of free domain children’s stories from Project Gutenberg.
Creating markov chains is fast. Reading a 10MB input file, building the chain and using it to generate 1,000 words takes 3 seconds on my laptop.
I’ll briefly explain how the markov chain is built. First we need to decide the chain state size. If the state size is 3 words it means that we’re predicting the next word based solely on the previous 3 words. To build such a chain we iterate over the whole input text and build a dictionary which has as keys all the 3-word sequences and as values the word that follows immediately after them. Actually it’s slightly more complex than that. As a 3-word sequence can appear multiple times in the text, instead of just storing the word that follows, we store all the possible following words together with their frequency. Then when we want to predict the next word given a 3-word sequence we can randomly choose one of the options based on the weighted probability.
I wrote one in go (based on this example with some additions). I also tried some other libraries which have more optimisations, such as preventing the generated text from being too close to the original or letting you do things like tagging each word with its syntactic part-of-speech using NLTK.
Here’s an example generated text from the stories dataset using a state size of 2 words:
He had always supposed that real friendship is formed by wallowing as they turned again to a huge door composed entirely of diamonds. And day by day. After you fill your bags. Ali Baba followed them barking, and began to cry. So at last to the bean-stalk, he found carrying food to eat, he told his guest in the grass, and she fell down dead. But Jack did have a wonderful sense of moving in the place was topsy-turvy.
That’s not too bad. Next let’s try the interactive version of the game, generating one word at a time. Here’s an example of the “gameplay” (italic words are my input):
The big ducks were
### Hit a dead end! ###
This game is not fun at all. It can hardly progress for a few words before hitting a dead end. This happens because the chain doesn’t have a key in the dictionary with this exact three word sequence, and this happens no matter how boring and expected words the user tries to give.
My next attempt was using a recurrent neural network. I was inspired by Karpathy’s excellent blog post. RNNs can learn the patterns in the text, including things like its shape, where to close quotations and parentheses and to even produce plausible looking source code when trained enough. Essential to this success is the use of LSTMs – Long Short-term Memory networks, a special kind of neural network which you can read more about in this post. I don’t yet have a deep understanding of how RNNs work, and for this exercise I’m treating them like a black box that I can use for text generation.
I found a tensorflow implementation of char-rnn, based on Karpathy’s implementation. This is a character-level model, which means it can predict a character at a time.
Next we need to train the neural network. This was a fair bit trickier compared to creating a markov chain. If the input text file is smaller than about 1MB (that’s ~1 million words) then the RNN takes “only” an hour or two to train, but the amount of data is not enough to train it effectively, so it will output something that is almost English but not quite. Like this:
“Not again. Gut Jan. Some came from the imoniness could signess is!”
Kramer stoodorment voicery.
“Every mittantly, if any_,” he said.
And as the size of the input text gets larger, the training time grows. There’s also a few parameters that might need tuning to get better results, depending on the size of the dataset, like rnn_size and num_layers (some discussion here).
I ended up using a 3.6 MB text file to train the network with (by trimming my original dataset which was closer to 10MB). I increased the default num_epochs and rnn_size and decreased the default batch_size hoping this will give me better results. Then I sat back and waited for almost 2 days! Wow that was long, the result better be good. Here is a sample:
Then suddenly her women said, “I see,” replied Aladdin, “I am a hopeful Hermione’s pocket particularly to seek him; and so give me what it is. This has denied three signal to my mother you are!”
But he compared and white as snow, and then stood so tightly.
“Never in fair Self Care of all the World!” exclaimed Hu-lin’s command. “Thou art yet are turned into us.” So she used to fear the bones to contradict the woodman. She was going to be able to take him towards the hole, so that he could see how they were to die afraid about the sitting gloom, an admiration estated with the youth.
“There’s a tiger,” ejected Joachim; “and now the old witch can be off in the thickest ogre!”
This still feels like it needs more training, but it’s much more English-like compared to the first sample, at least at the word level.
Because the training time is so long I couldn’t experiment much with the various parameters.
I think the best way to go if you don’t have a computer with a CUDA-compatible GPU is it use an amazon EC2 instance. There are g2 and p2 instances which have from 1 to 16 GPUs, and you can pay for a few hours of usage using a spot instance.
I actually tried to do that but even though I had all the cuda dependencies installed as far as I could tell, I was getting a tensorflow error about not being able to initialise the cuda driver. After spending a good amount of time trying to fix it, I thought I could be using this time to train a neural network slowly on my laptop instead. But if I was doing this more often I would invest the time to setup correctly a GPU instance.
Finally, let’s try the one-word-at-a-time version. I modified the code so that the neural network gives me just a word at a time, and then prompts the user for a word, ad infinitum. The model uses all the text in the story so far as the “prime” text for its next prediction.
Here is an example (italic words are my input):
Once upon a time there was a handsome young man named Ku-land. Ku-land had five hundred thousand mules, six horses, seven little sheep cry, and was very wise:
That’s much better! The good thing about it is that it never dead ends. Even if you make words up it can always come up with a suggestion for the next word. The game is still difficult to play though as the suggestions are not always syntactically correct or easy to continue with.
So, can you replace your human improv players with a program? No, not yet. But if you wait a bit longer and train it with more GPUs, maybe!