Salmon Run: Sentence Genre Classification using Scikit-Learn Linear SVC

To satisfy the (optional) real-world project requirement for my Introduction to Data Science class on Coursera, I built a classifier that could differentiate between a sentence from the medical versus the legal domain. It was based on interpolated trigram language models built out of training sets for both genres, and an unseen sentence was classified based on its probability of being part of one language model or the other. You can find the full report and the associated code on my github page here.

The data consisted of 950,887 medical sentences and 837,393 legal sentences. 2,000 sentences (1,000 each from medical and legal) were used to test the classifier. The overall accuracy of 92.7%, which was good enough for our (real-world business) purposes. However, it got me wondering whether I could get comparable results by using a simpler, more mainstream approach. After all, we could just treat this as a simple text classification problem, with each sentence being an instance and each word in the sentence being a feature. So thats what I did - this post describes that effort.

Our training data comes from selected volumes of the Gale Encyclopedia of Medicine for the medical content, and the UCI Machine Learning Repository Legal Case Reports Dataset for the legal content. Both are in XML format, so our first task is to parse these files and convert them to a flat file of sentences, one sentence per line. Here is some code to do that.

# -*- coding: utf-8 -*-
# Source: preprocess.py
# Code to convert from XML format to a file of sentences for
# each genre, one sentence per line.
from __future__ import division
import glob
import nltk
import re
import unicodedata
from xml.dom.minidom import Node
from xml.dom.minidom import parseString

def medical_plaintext(fn):
  print "processing", fn
  if not (fn.startswith("data/medical/eph_") or
      fn.startswith("data/medical/gemd_") or
      fn.startswith("data/medical/gesu_") or
      fn.startswith("data/medical/gea2_") or
      fn.startswith("data/medical/gem_") or
      fn.startswith("data/medical/gech_") or
      fn.startswith("data/medical/geca_") or
      fn.startswith("data/medical/gecd_") or
      fn.startswith("data/medical/gegd_") or
      fn.startswith("data/medical/gend_") or
      fn.startswith("data/medical/gec_") or
      fn.startswith("data/medical/genh_") or
      fn.startswith("data/medical/nwaz_")):
    return ""
  file = open(fn, 'rb')
  data = file.read()
  file.close()
  # remove gale: namespace from attributes
  data = re.sub("gale:", "", data)
  dom = parseString(data)
  text = ""
  paragraphs = dom.getElementsByTagName("p")
  for paragraph in paragraphs:
    xml = paragraph.toxml()
    xml = re.sub("\n", " ", xml)
    xml = re.sub("<.*?>", "", xml)
    text = text + " " + xml
  text = re.sub("\\s+", " ", text)
  text = text.strip()
  text = text.encode("ascii", "ignore")
  return text

def legal_plaintext(fn):
  print "processing", fn
  file = open(fn, 'rb')
  data = file.read()
  data = re.sub("&eacute;", "e", data)
  data = re.sub("&aacute;", "a", data)
  data = re.sub("&yacute;", "y", data)
  data = re.sub("&nbsp;", " ", data)
  data = re.sub("&tm;", "(TM)", data)
  data = re.sub("&reg;", "(R)", data)
  data = re.sub("&agrave;", "a", data)
  data = re.sub("&egrave;", "e", data)
  data = re.sub("&igrave", "i", data)
  data = re.sub("&ecirc;", "e", data)
  data = re.sub("&ocirc;", "o", data)
  data = re.sub("&icirc;", "i", data)
  data = re.sub("&ccedil;", "c", data)
  data = re.sub("&amp;", "and", data)
  data = re.sub("&auml;", "a", data)
  data = re.sub("&szlig;", "ss", data)
  data = re.sub("&aelig;", "e", data)
  data = re.sub("&iuml;", "i", data)
  data = re.sub("&euml;", "e", data)
  data = re.sub("&ouml;", "o", data)
  data = re.sub("&uuml;", "u", data)
  data = re.sub("&acirc;", "a", data)
  data = re.sub("&oslash;", "o", data)
  data = re.sub("&ntilde;", "n", data)
  data = re.sub("&Eacute;", "E", data)
  data = re.sub("&Aring;", "A", data)
  data = re.sub("&Ouml;", "O", data)
  data = unicodedata.normalize("NFKD",
    unicode(data, 'iso-8859-1')).encode("ascii", "ignore")
  # fix "id=xxx" pattern, causes XML parsing to fail
  data = re.sub("\"id=", "id=\"", data)
  file.close()
  text = ""
  dom = parseString(data)
  sentencesEl = dom.getElementsByTagName("sentences")[0]
  for sentenceEl in sentencesEl.childNodes:
    if sentenceEl.nodeType == Node.ELEMENT_NODE:
      stext = sentenceEl.firstChild.data
      if len(stext.strip()) == 0:
        continue
      text = text + " " + re.sub("\n", " ", stext)
  text = re.sub("\\s+", " ", text)
  text = text.strip()
  text = text.encode("ascii", "ignore")
  return text

def parse_to_plaintext(dirs, labels, funcs, sent_file, label_file):
  fsent = open(sent_file, 'wb')
  flabs = open(label_file, 'wb')
  idx = 0
  for dir in dirs:
    files = glob.glob("/".join([dir, "*.xml"]))
    for file in files:
      text = funcs[idx](file)
      if len(text.strip()) > 0:
        for sentence in nltk.sent_tokenize(text):
          fsent.write("%s\n" % sentence)
          flabs.write("%d\n" % labels[idx])
    idx += 1
  fsent.close()
  flabs.close()

def main():
  parse_to_plaintext(["data/medical", "data/legal"],
    [1, 0], [medical_plaintext, legal_plaintext],
    "data/sentences.txt", "data/labels.txt")

if __name__ == "__main__":
  main()

The code just reads the two directories full of medical and legal XML files, and writes out the sentences one per line into a file called sentences.txt. Parallelly it also writes out a 1 or 0 to another file labels.txt depending on whether the input file being read is from the medical or legal corpus. The code is largely similar to that for my previous classifier, except that I write out a single file of sentences. This is so I can more easily use Scikit-learn's text API to vectorize the sentences, as described below.

I construct a pipeline of a CountVectorizer to count words, eliminating English stopwords and lowercasing the input. This count vector is then passed to the TfidfTransformer which converts the count vector to a TF-IDF vector, which is the X (feature) vector for our classification algorithm. I use L2 normalization to scale the vector. The outcome vector is read off the labels.txt file with np.loadtxt().

The X and y vectors are then fed into Scikit-Learn's Linear Support Vector Classifier (SVC) algorithm. LinearSVC is a popular classifier for text, since the number of features tend to be quite large in text classification problems. Although it is generally advisable to use L1 loss function, I got very good results (97% accuracy) with L2 during my 10-fold cross validation phase. This was with simply using individual words as features. I did try to use bigrams and trigrams along with single word features, capping the maximum number of features to the 10,000 most frequent, but the program took a long time and I eventually killed it.

Here is the code that wraps the classifier. The code for cross validation is triggered by passing in an argument "xval". Passing in an argument "run" will split the input data (our list of sentences and labels) to be split 90%/10% for training/test. A model is then created and persisted with the training set, and the model evaluated against the training set. We then run the model against the testing set and evaluate the results.

# Source: classify.py
from __future__ import division

import sys

import cPickle as pickle
import datetime
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

# total number of sentences (combined)
NTOTAL = 1788280

def generate_xy(texts, labels):
  ftext = open(texts, 'rb')
  pipeline = Pipeline([
    ("count", CountVectorizer(stop_words='english', min_df=0.0,
              binary=False)),
    ("tfidf", TfidfTransformer(norm="l2"))
  ])
  X = pipeline.fit_transform(ftext)
  ftext.close()
  flabel = open(labels, 'rb')
  y = np.loadtxt(flabel)
  flabel.close()
  return X, y

def crossvalidate_model(X, y, nfolds):
  kfold = KFold(X.shape[0], n_folds=nfolds)
  avg_accuracy = 0
  for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = X[train], X[test], y[train], y[test]
    clf = LinearSVC()
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    accuracy = accuracy_score(ytest, ypred)
    print "...accuracy = ", accuracy
    avg_accuracy += accuracy
  print "Average Accuracy: ", (avg_accuracy / nfolds)

def train_model(X, y, binmodel):
  model = LinearSVC()
  model.fit(X, y)
  # reports
  ypred = model.predict(X)
  print "Confusion Matrix (Train):"
  print confusion_matrix(y, ypred)
  print "Classification Report (Train)"
  print classification_report(y, ypred)
  pickle.dump(model, open(binmodel, 'wb'))

def test_model(X, y, binmodel):
  model = pickle.load(open(binmodel, 'rb'))
  if y is not None:
    # reports
    ypred = model.predict(X)
    print "Confusion Matrix (Test)"
    print confusion_matrix(y, ypred)
    print "Classification Report (Test)"
    print classification_report(y, ypred)

def print_timestamp(message):
  print message, datetime.datetime.now()

def usage():
  print "Usage: python classify.py [xval|test|train]"
  sys.exit(-1)
  
def main():
  if len(sys.argv) != 2:
    usage()
  print_timestamp("started:")
  X, y = generate_xy("data/sentences.txt", "data/labels.txt")
  if sys.argv[1] == "xval":
    crossvalidate_model(X, y, 10)
  elif sys.argv[1] == "run":
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
      test_size=0.1, random_state=42)
    train_model(Xtrain, ytrain, "data/model.bin")
    test_model(Xtest, ytest, "data/model.bin")
  else:
    usage()
  print_timestamp("finished:")
  
if __name__ == "__main__":
  main()

The output of our cross validation looks like this:

sujit@cyclone:medorleg2$ python classify.py xval
started: 2013-08-28 20:37:35.097280
...accuracy =  0.938426868276
...accuracy =  0.974534189277
...accuracy =  0.989134811103
...accuracy =  0.98005345919
...accuracy =  0.970250743731
...accuracy =  0.972509897779
...accuracy =  0.971810902096
...accuracy =  0.972672064777
...accuracy =  0.96800836558
...accuracy =  0.976105531572
Average Accuracy:  0.971350683338
finished: 2013-08-28 20:41:35.281316
sujit@cyclone:medorleg2$

And the output of the run (train then test) looks like this (the data from the confusion matrix has been prettified a bit).

sujit@cyclone:medorleg2$ python classify.py run
started: 2013-08-28 21:25:20.398061

Confusion Matrix (Train):
                     0      1
          0     745509   7931
          1       7989 848023

Classification Report (Train)
             precision    recall  f1-score   support

          0       0.99      0.99      0.99    753440
          1       0.99      0.99      0.99    856012

avg / total       0.99      0.99      0.99   1609452

Confusion Matrix (Test)
                     0      1
          0      82686   1267
          1       1482  93393

Classification Report (Test)
             precision    recall  f1-score   support

          0       0.98      0.98      0.98     83953
          1       0.99      0.98      0.99     94875

avg / total       0.98      0.98      0.98    178828

finished: 2013-08-28 21:28:02.311399

As you can see, the accuracy of the classifier with the unseen test set is 0.98, which is better than the language model based classifier. The solution is also simpler and needs less explanation since it depends on well-known algorithms which have been developed and implemented by machine learning experts.

As before, I cannot provide the medical data since it is a non-free dataset, but the code for the two Python programs described in this post can be found on github here.