SurpriseHaiku - Discovering Unusual Haikus On Twitter
SurpriseHaiku was an experiment that parsed random twitter tweets to see if they followed the Haiku cadence of 5 / 7 / 5 (syllables). I ran this experiment during the 2014 Olympics to try and focus around Olympics related tweets. This was an application that I built to learn more about the twitter API and dip my toes in the world of Natural Language Parsing or NLP.
While the application is not running at this point, you can check out the historical tweets on the twitter profile.
The simplest explanation is:
- Normalize and tokenize the tweet text
- Split it by syllable/ each word by syllable
- see if we get a 5/7/5 combination
Identifying the syllables in the English language is tough. We've got a lot of different ways of pronouncing things not to mention crazy spellings and the presence of other languages. I use two methods to split up syllables:
- Use a dictionary of hyphenated words
- Use a hyphenator
I then compared the return values. If we had a dictionary entry I would use it, if not I would use what the hyphenator spat out.
The tools I used in this project are:
- PyHyphen which is the hyphenation library of LibreOffice and FireFox
- Tweepy for my Twitter API Integration
- Project Gutenberg Etext of Movy Hyphenator by Grady Ward (basically a dictionary of hyphenated words)
Although not every tweet was successful, there were definitely some notable examples.
This one really seems to flow although the sentence composition doesn't line up perfectly, the syllables do.
I also ran this experiment during the Olympics, this one gave me a chuckle.
As did this one.
As many know about natural language processing, it's really difficult to get right. Marti Hearst, my NLP professor at UC Berkeley, has stressed that you can never get it all right. The more I've experimented, the more I've come to realize she is right.
The challenges with this project is in NLP. Think of it this way, how do we know that RT means retweet without explicitly telling the computer that? Fundamentally how are we crossing the semantic gap - how do we communicate semantics to a machine?
Improvements and Conclusions
I think there are a number of improvements that could be made to this project - but this was a short experiment, not a complete study.
Firstly I think I could use some semantic analysis to prevent tweets like this:
The problem with this haiku is that it ends in "the". There's no continuity and it doesn't end logically. We could improve this with part-of-speech tagging and preventing certain word endings.
Other improvements include, automating deployment, ignoring tweets with a lot of gibberish or words that are likely to be wrong, automating the addition of new words to the dictionary, and adding plurals to the dictionary could all be helpful.