This post is meant as a summary of many of the concepts that I learned in Marti Hearst's Natural Language Processing class at the UC Berkeley School of Information. I wanted to record the concepts and approaches that I had learned with quick overviews of the code you need to get it working. I figured that it could help some other people get a handle on the goals and code to get things done.
I would encourage anyone else to take a look at the Natural Language Processing with Python and read more about scikit-learn.
Tokenization
The goal of tokenization is to break up a sentence or paragraph into specific tokens or words. We basically want to convert human language into a more abstract representation that computers can work with.
Sometimes you want to split sentence by sentence and other times you just want to split words.
Sentence Tokenizers
sent\_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
Here's a popular word regular expression tokenizer from the NLTK book that works quite well.
Word Tokenizers
tokenization\_pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)\* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \w+[\x90-\xff] # these are escaped emojis
| [][.,;"'?():-\_`] # these are separate tokens
'''
word\_tokenizer = nltk.tokenize.regexp.RegexpTokenizer(tokenization\_pattern)
Part of Speech Tagging
Once you've tokenized the sentences you need to tag them. Tagging is not necessary for all purposes but it does help the computer better understand the objects and references in your sentences. Remember our goal is to encode semantics, not words, and tagging can help us do that.
Unfortunately, this is an imperfect science, it's just never going to work out perfectly because in so many sentences there are so many different representations of text. Let me show you what I mean, I'll be using a comical example of a garden path sentence.
This sentence is comical because death can either happen more slowly than thought (as in we had an expectation of death happening at a certain rate of speed).
But semantically, the speed of death can compared to the speed of thought which is obviously strange. Once you learn about these kinds of comical sentence structures, you start to seem them more often.
This one is also comical. In this sentence we've got two meanings as well. McDonald's fries are the holy grail for potato farmers or more comically McDonald's fries the actual holy grail for potato farmers. A comical mental image.
images from Sentence first
Thus part of speech tagging is never perfect, because there are so many interpretations.
Built in tagger
This is the built in tagger, the one that NLTK recommends. It's pretty slow when working on sort of large corpus.
nltk.pos\_tag(sentence) # tokenized sentence
nltk.batch\_pos\_tag(sentences) # for lots of tokenized sentences
Unigram, Bigram, and Backoff Tagging
These are backoff taggers, basically it's just a dictionary look up to tag parts of speech. You train it on a tagged corpus(or corpora) and then use it to tag sentences in the future.
default\_tagger = nltk.DefaultTagger('NN')
raw = r'''what will this silly tagger do?'''
tokens = nltk.word\_tokenize(raw)
print default\_tagger.tag(tokens)
# [('what', 'NN'), ('will', 'NN'), ('this', 'NN'), ('silly', 'NN'), ('tagger', 'NN'), ('do', 'NN'), ('?', 'NN')]
Here's how you train the tagger on brown, this is a unigram tagger, so it's not going to perform really well because it will tag everything as a NN (noun) or whatever part of speech we give it.
from nltk.corpus import brown
brown\_tagged\_sents = brown.tagged\_sents()
unigram\_tagger = nltk.UnigramTagger(brown\_tagged\_sents)
print "%0.3f" % unigram\_tagger.evaluate(test\_sents) # eval the tagger
This is a true backoff tagger that defaults to a certain part of speech. So it will look for trigram occurrences and see if it finds any with a certain word formation, if it does not then it will backoff to the bigram tagger, etc.
def build\_backoff\_tagger(train\_sents):
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train\_sents, backoff=t0)
t2 = nltk.BigramTagger(train\_sents, backoff=t1)
t3 = nltk.TrigramTagger(train\_sents, backoff=t2)
return t3
ngram\_tagger = build\_backoff\_tagger(train\_sents)
What's nice is to speed things up, you can actually just pickle the backoff tagger so that it's easier to deploy a tagger if need be.
import pickle # or cPickle
with open('pickled\_file.pickle', 'wb') as f:
pickle.dump(ngram\_tagger, f)
with open('pickled\_file.pickle', 'r') as f:
tagger = pickle.load(f)
Removing Punctuation
At times you'll need to remove certain punctuation marks - this is an easy way to do so.
import string
nopunct = [w for w in text if w not in string.punctuation]
' '.join(nopunct[0:100])
Stopwords
Here's an easy way to remove stop words.
from nltk.corpus import stopwords
normalized = [w for w in text6 if w.lower() not in stopwords.words('english')]
Extend it with:
from nltk.corpus import stopwords
my\_stops = stopwords
my\_stops.append("shoebox")
Stemming
Stemming the process by which endings are removed from words in order to remove things like tense or plurality. It's not appropriate for all cases but can make it easier to connect together tenses to see if you're covering the same subject matter.
pstemmer = nltk.PorterStemmer()
lstemmer = nltk.LancasterStemmer()
wnlemmatizer = nltk.WordNetLemmatizer()
Frequency Distributions
A common go to to see what's going on with certain text data sets, frequency distributions allow you to see the frequency at which certain words occur and plot it if need be.
fd = nltk.FreqDist(data)
fd.plot()
fd.plot(50, cumulative=True)
fd.most\_common(12)
Collocations, Bigrams, Trigrams
Bigrams and trigrams are just words that are commonly found together and measures their relevance by a certain measurement.
bigram\_measures = nltk.collocations.BigramAssocMeasures()
trigram\_measures = nltk.collocations.TrigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from\_words(text)
finder.nbest(bigram\_measures.pmi, 10)
Chunking
Chunking basically just grabs chunks of text that might be more meaningful to your research or program. You create a list of parts of speech and run that over your corpus. It will extract the phrasing that you need.
Remember you've got to customize it to the part of speech tagger that you're using, like Brown or the Stanford Tagger.
technical\_term = r"T: {<(JJ|NN|NNS|NNP|NNPS)>+<(NN|NNS|NNP|NNPS|CD)>|<(NN|NNS|NNP|NNPS)>}"
cp = nltk.RegexpParser(technical\_term)
for count, sent in enumerate(brown.sents()[100:104]):
print "Sentence #" + str(count) + ":"
parsed = cp.parse(nltk.pos\_tag(sent))
print parsed
print "\nTechnical Terms:\n"
for tree in parsed.subtrees():
if tree.label() == "T":
print tree
Splitting Training Sets + Test Sets
This is a simple way that Marti showed us that allows for simple splitting of test sets.
This splits it into thirds.
Train, Dev, Test Sets
def create\_training\_sets\_trips(feature\_function, items):
featuresets = [(feature\_function(key), value) for (key, value) in items]
third = int(float(len(featuresets)) / 3.0)
return items[0:third], items[third:third\*2], items[third\*2:], featuresets[0:third], featuresets[third:third\*2], featuresets[third\*2:]
train\_items, dev\_items, test\_items, train\_features, dev\_features, test\_features = create\_training\_sets\_trips(f\_func, data)
This splits it into halves.
Simpler Test Sets
def create\_training\_sets(feature\_function, items):
featuresets = [(feature\_function(key), value) for (key, value) in items]
halfsize = int(float(len(featuresets)) / 2.0)
train\_set, test\_set = featuresets[halfsize:], featuresets[:halfsize]
return train\_set, test\_set
train, test = create\_training\_sets(f\_func, data)
Classifiers & Scikit-learn
Now there are plenty of different ways of classifying text, this isn't an exhaustive list but it's a pretty good starting point.
TF-IDF
See my other two posts on TF-IDF here:
Naive Bayes Classifiers
This is a simple Naive Bayes classifier.
from sklearn.naive\_bayes import MultinomialNB
cl = nltk.NaiveBayesClassifier.train(train\_set)
print "%.3f" % nltk.classify.accuracy(cl, test\_set)
cl.show\_most\_informative\_features(40)
cl.prob\_classify(featurize(name)) # get a confidence for the prediction
SVC Classifier
SVMs need numerican inputs, it can take text-based features so you have to convert these features into numbers before passing them to this classifier.
from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
svmc = SklearnClassifier(SVC(), sparse=False).train(train\_features)
Decision Tree Classification
This is a simple decision tree classifier.
dtc = nltk.classify.DecisionTreeClassifier.train(train\_features, entropy\_cutoff=0, support\_cutoff=0)
Maximum Entropy Classifier
A maximum entropy classifier and some helpful explainers here.
import numpy
import scipy
from nltk.classify import maxent
nltk.classify.MaxentClassifier.ALGORITHMS
# ['GIS','IIS','CG','BFGS','Powell','LBFGSB','Nelder-Mead','MEGAM','TADM']
# MEGAM or TADM are not rec'd for text classification
mec = nltk.classify.MaxentClassifier.train(train\_features, 'GIS', trace=0, max\_iter=1000)
Cross Validating Classifiers
One thing you'll need to avoid over-fitting is you'll want to cross validate with k-folds. This can help you see where you might be over-fitting in your corpus.
from sklearn import cross\_validation
cv = cross\_validation.KFold(len(train\_features), n\_folds=10, indices=True, shuffle=False, random\_state=None)
for traincv, evalcv in cv:
classifier = nltk.NaiveBayesClassifier.train(train\_features[traincv[0]:traincv[len(traincv)-1]])
print 'accuracy: %.3f' % nltk.classify.util.accuracy(classifier, train\_features[evalcv[0]:evalcv[len(evalcv)-1]])
Creating Pipelines for Classifiers
Finally creating pipelines can help speed things up immensely, especially when you're moving to more production level code.
import sklearn
from sklearn.svm import LinearSVC
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.feature\_extraction.text import TfidfTransformer
from sklearn.feature\_selection import SelectKBest, chi2
from sklearn.naive\_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('tfidf', TfidfTransformer()),
('chi2', SelectKBest(chi2, k=2000)),
('nb', MultinomialNB())])
pipecl = SklearnClassifier(pipeline)
pipecl.train(train\_features)