SurpriseHaiku - Discovering Unusual Haikus On Twitter

SurpriseHaiku was an experiment that parsed random twitter tweets to see if they followed the Haiku cadence of 5 / 7 / 5 (syllables). I ran this experiment during the 2014 Olympics to try and focus around Olympics related tweets. This was an application that I built to learn more about the twitter API and dip my toes in the world of Natural Language Parsing or NLP.

While the application is not running at this point, you can check out the historical tweets on the twitter profile.

Methodology

The simplest explanation is:

Normalize and tokenize the tweet text
Split it by syllable/ each word by syllable
see if we get a 5/7/5 combination

Identifying the syllables in the English language is tough. We've got a lot of different ways of pronouncing things not to mention crazy spellings and the presence of other languages. I use two methods to split up syllables:

Use a dictionary of hyphenated words
Use a hyphenator

I then compared the return values. If we had a dictionary entry I would use it, if not I would use what the hyphenator spat out.

Tools

The tools I used in this project are:

Python
PyHyphen which is the hyphenation library of LibreOffice and FireFox
Tweepy for my Twitter API Integration
Project Gutenberg Etext of Movy Hyphenator by Grady Ward (basically a dictionary of hyphenated words)

Notable Examples

Although not every tweet was successful, there were definitely some notable examples.

This one really seems to flow although the sentence composition doesn't line up perfectly, the syllables do.

A #haiku: https://t.co/Nn55PyBqab
Lets Show Why Baseball Is The Best Sport RT if you want Baseball back
— SurpriseHaiku (@surprisehaiku) February 10, 2014

I also ran this experiment during the Olympics, this one gave me a chuckle.

A #haiku: https://t.co/zei5yF4E2D
every time they show putin i get really uncomfortable
— SurpriseHaiku (@surprisehaiku) February 10, 2014

As did this one.

A #haiku: https://t.co/2PpXgTSshU
Russia Putin Poses as Defender of Christian Civilization
— SurpriseHaiku (@surprisehaiku) February 10, 2014

Challenges

As many know about natural language processing, it's really difficult to get right. Marti Hearst, my NLP professor at UC Berkeley, has stressed that you can never get it all right. The more I've experimented, the more I've come to realize she is right.

The challenges with this project is in NLP. Think of it this way, how do we know that RT means retweet without explicitly telling the computer that? Fundamentally how are we crossing the semantic gap - how do we communicate semantics to a machine?

Improvements and Conclusions

I think there are a number of improvements that could be made to this project - but this was a short experiment, not a complete study.

Firstly I think I could use some semantic analysis to prevent tweets like this:

A #haiku: https://t.co/dnq4ji8nSw
I just dont see how school is gonna be a part of my life with the
— SurpriseHaiku (@surprisehaiku) February 10, 2014

The problem with this haiku is that it ends in "the". There's no continuity and it doesn't end logically. We could improve this with part-of-speech tagging and preventing certain word endings.

Other improvements include, automating deployment, ignoring tweets with a lot of gibberish or words that are likely to be wrong, automating the addition of new words to the dictionary, and adding plurals to the dictionary could all be helpful.