Salmon Run: Sentence Genre Classification with WEKA and LibLINEAR

One drawback with using Scikit-Learn to classify sentences into either a medical or legal genre, as described in my previous post, is that we are a Java shop. A pickled Python model cannot be used directly as part of a Java based data pipeline. Perhaps the simplest way to get around this would be to wrap the model within a HTTP service. However, a pure Java solution is likely to be better received in our environment, so I decided to use Weka, a Java-based data mining toolkit/library that I have used before. This post describes the effort.

Rather than building up the entire pipeline from scratch, I decided to keep the Scikit-Learn text processing pipeline intact, and only use Weka to build the classifier and predict using it. Weka, like Scikit-Learn's X and y matrices, has a very well-defined input format called Attribute Relation File Format (ARFF). You can define the input to any of Weka's algorithm using this format, so the first step is to convert the X and y (SciPy sparse) matrices generated by Scikit-Learn's text processing pipeline into ARFF files. SciPy has ARFF readers (to read Weka input files) but no writers, so I wrote a simple one for my needs. Here it is:

# Source: src/medorleg2/arffwriter.py
import os.path
import numpy as np
import operator

def qq(s):
  return "'" + s + "'"

def save_arff(X, y, vocab, fname):
  aout = open(fname, 'wb')
  # header
  aout.write("@relation %s\n\n" %
    (os.path.basename(fname).split(".")[0]))
  # input variables
  for term in vocab:
    aout.write("@attribute \"%s\" numeric\n" % (term))
  # target variable
  aout.write("@attribute target_var {%s}\n" %
    (",".join([qq(str(int(e))) for e in list(np.unique(y))])))
  # data
  aout.write("\n@data\n")
  for row in range(0, X.shape[0]):
    rdata = X.getrow(row)
    idps = sorted(zip(rdata.indices, rdata.data), key=operator.itemgetter(0))
    if len(idps) > 0:
      aout.write("{%s,%d '%d'}\n" % (
        ",".join([" ".join([str(idx), str(dat)]) for (idx,dat) in idps]),
        X.shape[1], int(y[row])))
  aout.close()

The harness to call the save_arff() method repeats some of the code in the classify.py (from last week's post). Essentially, it builds up a Scikit-Learn text processing pipeline to vectorize the sentences.txt and labels.txt containing our sentences and genre labels respectively into an X matrix of data and y matrix of target variables, then call the save_arff() function to output the training and test ARFF files. It is shown below:

# Source: src/medorleg2/arffwriter_test.py
import sys
import operator

from arffwriter import save_arff
import datetime
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

def load_xy(xfile, yfile):
  pipeline = Pipeline([
    ("count", CountVectorizer(stop_words='english', min_df=0.0,
              binary=False)),
    ("tfidf", TfidfTransformer(norm="l2"))
  ])
  xin = open(xfile, 'rb')
  X = pipeline.fit_transform(xin)
  xin.close()
  yin = open(yfile, 'rb')
  y = np.loadtxt(yin)
  yin.close()
  vocab_map = pipeline.steps[0][1].vocabulary_
  vocab = [x[0] for x in sorted([(x, vocab_map[x]) 
                for x in vocab_map], 
                key=operator.itemgetter(1))]
  return X, y, vocab

def print_timestamp(message):
  print message, datetime.datetime.now()

def main():
  if len(sys.argv) != 5:
    print "Usage: arffwriter_test Xfile yfile trainARFF testARFF"
    sys.exit(-1)
  print_timestamp("started:")
  X, y, vocab = load_xy(sys.argv[1], sys.argv[2])
  Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
    test_size=0.1, random_state=42)
  save_arff(Xtrain, ytrain, vocab, sys.argv[3])
  save_arff(Xtest, ytest, vocab, sys.argv[4])
  print_timestamp("finished:")
  
if __name__ == "__main__":
  main()

Running the arffwriter_test.py as shown below will produce the training and test ARFF files named in the command.

1
2
3

sujit@cyclone:medorleg2$ python arffwriter_test.py \
    data/sentences.txt data/labels.txt \
    data/medorleg2_train.arff data/medorleg2_test.arff

On the Weka side, the analog of Scikit-Learn's LinearSVC algorithm is the LibLINEAR algorithm. LibLINEAR is not included in the Weka base package, and it is not obvious how to integrate it into the (current stable) 3.6 version, as this Stack Overflow post will attest. The (dev) 3.7 version comes with a package manager which makes this process seamless. Unfortunately, it requires an upgrade to Java 1.7, which required (for me) an upgrade to OSX 10.8 (Mountain Lion) :-). I ended up doing all this, because I would have to do it at some point anyway.

In any case, after upgrading to Weka 3.7 and installing LibLINEAR, I was able to run a small sample of 20 sentences using the Weka GUI. Here is the output from the run:

=== Run information ===

Scheme:       weka.classifiers.functions.LibLINEAR -S 1 -C 1.0 -E 0.01 -B 1.0
Relation:     medorleg2_10_train
Instances:    18
Attributes:   388
              [list of attributes omitted]
Test mode:    10-fold cross-validation

=== Classifier model (full training set) ===

LibLINEAR wrapper

Time taken to build model: 0 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          16               88.8889 %
Incorrectly Classified Instances         2               11.1111 %
Kappa statistic                          0.7778
Mean absolute error                      0.1111
Root mean squared error                  0.3333
Relative absolute error                 22.093  %
Root relative squared error             66.2701 %
Coverage of cases (0.95 level)          88.8889 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances               18     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.778    0.000    1.000      0.778    0.875      0.798    0.889     0.889     0
                 1.000    0.222    0.818      1.000    0.900      0.798    0.889     0.818     1
Weighted Avg.    0.889    0.111    0.909      0.889    0.888      0.798    0.889     0.854     

=== Confusion Matrix ===

 a b   <-- classified as
 7 2 | a = 0
 0 9 | b = 1

I tried running the full dataset using the GUI but it ran out of memory even with 6GB of heap. So I ended up running the training (with 10 fold cross validation) from the command line (based on info from the Weka Primer) like so:

sujit@cyclone:weka-3-7-10$ nohup java \
  # classpath contains weka and LibLINEAR jars
  -classpath \
  $HOME/wekafiles/packages/LibLINEAR/LibLINEAR.jar:\
  $HOME/wekafiles/packages/LibLINEAR/lib/liblinear-1.92.jar:\
  weka.jar \
  # gave it 4GB, may run with less
  -Xmx4096M \
  # full path of the LibLINEAR classifier
  weka.classifiers.functions.LibLINEAR \
  # parameters copied from GUI defaults
  -S 1 -C 1.0 -E 0.01 -B 1.0 \
  # training file path
  -t /path/to/medorleg2_train.arff \
  # report statistics
  -k 
  # dump model to file
  -d /path/to/medorleg2_model.bin &

The report in nohup.out looked like this:

Zero Weights processed. Default weights will be used

Options: -S 1 -C 1.0 -E 0.01 -B 1.0 

LibLINEAR wrapper

Time taken to build model: 21.25 seconds
Time taken to test model on training data: 15.42 seconds

=== Error on training data ===

Correctly Classified Instances     1583458               99.0813 %
Incorrectly Classified Instances     14682                0.9187 %
Kappa statistic                          0.9815
K&B Relative Info Score            156857147.6118 %
K&B Information Score              1563132.5079 bits      0.9781 bits/instance
Class complexity | order 0         1592598.49   bits      0.9965 bits/instance
Class complexity | scheme          15768468 bits      9.8668 bits/instance
Complexity improvement     (Sf)    -14175869.51 bits     -8.8702 bits/instance
Mean absolute error                      0.0092
Root mean squared error                  0.0958
Relative absolute error                  1.8463 %
Root relative squared error             19.2159 %
Coverage of cases (0.95 level)          99.0813 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances          1598140     


=== Confusion Matrix ===

      a      b   <-- classified as
 734986   8705 |      a = 0
   5977 848472 |      b = 1



=== Stratified cross-validation ===

Correctly Classified Instances     1574897               98.5456 %
Incorrectly Classified Instances     23243                1.4544 %
Kappa statistic                          0.9708
K&B Relative Info Score            155133020.4174 %
K&B Information Score              1545951.0425 bits      0.9673 bits/instance
Class complexity | order 0         1592598.49   bits      0.9965 bits/instance
Class complexity | scheme          24962982 bits     15.62   bits/instance
Complexity improvement     (Sf)    -23370383.51 bits    -14.6235 bits/instance
Mean absolute error                      0.0145
Root mean squared error                  0.1206
Relative absolute error                  2.9228 %
Root relative squared error             24.1777 %
Coverage of cases (0.95 level)          98.5456 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances          1598140     


=== Confusion Matrix ===

      a      b   <-- classified as
 729780  13911 |      a = 0
   9332 845117 |      b = 1

Having generated the model, we now need to use it to predict new sentences. Since this part of the process will be called from external Java code, we need to use the Weka Java API. Here is some Scala code to read the attributes from an ARFF file, load the classifier model and use it to predict the accuracy of our test ARFF file (10% of the total data), as well as predict the genre of some random unseen sentences.

// Source: src/main/scala/com/mycompany/weka/MedOrLeg2Classifier.scala
package com.mycompany.weka

import java.io.{FileInputStream, ObjectInputStream}

import scala.Array.canBuildFrom

import weka.classifiers.functions.LibLINEAR
import weka.core.{Attribute, Instances, SparseInstance}
import weka.core.converters.ConverterUtils.DataSource

object MedOrLeg2Classifier extends App {

  val TrainARFFPath = "/path/to/training/ARFF/file" 
  val ModelPath = "/path/to/trained/WEKA/model/file"
  // copied from sklearn/feature_extraction/stop_words.py
  val EnglishStopWords = Set[String](
    "a", "about", "above", "across", "after", "afterwards", "again", "against",
    "all", "almost", "alone", "along", "already", "also", "although", "always",
    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    "around", "as", "at", "back", "be", "became", "because", "become",
    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    "below", "beside", "besides", "between", "beyond", "bill", "both",
    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    "something", "sometime", "sometimes", "somewhere", "still", "such",
    "system", "take", "ten", "than", "that", "the", "their", "them",
    "themselves", "then", "thence", "there", "thereafter", "thereby",
    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    "third", "this", "those", "though", "three", "through", "throughout",
    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    "within", "without", "would", "yet", "you", "your", "yours", "yourself",
    "yourselves")
    
  val source = new DataSource(TrainARFFPath)
  val data = source.getDataSet()
  val numAttributes = data.numAttributes()
  data.setClassIndex(numAttributes - 1)
  
  // features: this is only necessary for trying to classify
  // sentences outside the training set (see last block). In
  // such a case we would probably store the attributes in 
  // some external datasource such as a database table or file.
  var atts = new java.util.ArrayList[Attribute]()
  (0 until numAttributes).foreach(j =>
    atts.add(data.attribute(j)))
  val vocab = Map[String,Int]() ++ 
    (0 until numAttributes - 1).
    map(j => (data.attribute(j).name(), j))
  
  // load model
  val modelIn = new ObjectInputStream(new FileInputStream(ModelPath))
  val model = modelIn.readObject().asInstanceOf[LibLINEAR]
  
  // predict using data from test set and compute accuracy
  var numCorrectlyPredicted = 0
  (0 until data.numInstances()).foreach(i => {
    val instance = data.instance(i)
    val expectedLabel = instance.value(numAttributes - 1).intValue()
    val predictedLabel = model.classifyInstance(instance).intValue()
    if (expectedLabel == predictedLabel) numCorrectlyPredicted += 1
  })
  Console.println("# instances tested: " + data.numInstances())
  Console.println("# correctly predicted: " + numCorrectlyPredicted)
  Console.println("Accuracy (%) = " + 
    (100.0F * numCorrectlyPredicted / data.numInstances()))
    
  // predict class of random sentences
  val sentences = Array[String](
    "Throughout recorded history, humans have taken a variety of steps to control family size: before conception by delaying marriage or through abstinence or contraception; or after the birth by infanticide.",
    "I certify that the preceding sixty-nine (69) numbered paragraphs are a true copy of the Reasons for Judgment herein of the Honourable Justice Barker.")
  sentences.foreach(sentence => {
    val indices = sentence.split(" ").
      map(word => word.toLowerCase()).
      map(word => word.filter(c => Character.isLetter(c))).
      filter(word => word.length() > 1).
      filter(word => !EnglishStopWords.contains(word)).
      map(word => if (vocab.contains(word)) vocab(word) else -1).
      filter(index => index > -1).
      toList
    val scores = indices.groupBy(index => index).
      map(kv => (kv._1, kv._2.size))
    val norm = math.sqrt(scores.map(score => score._2).
      foldLeft(0D)(math.pow(_, 2) + math.pow(_, 2)))
    val normScores = scores.map(kv => (kv._1, kv._2 / norm))
    val instance = new SparseInstance(numAttributes)
    normScores.foreach(score => 
      instance.setValue(score._1, score._2))
    val instances = new Instances("medorleg2_test", atts, 0)
    instances.add(instance)
    instances.setClassIndex(numAttributes - 1)
    val label = model.classifyInstance(instances.firstInstance()).toInt
    Console.println(label)
  })
}

In order to mimic the classpath on the command line I added the following library dependencies into my build.sbt file.

libraryDependencies ++= Seq(
  ...
  "nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.6",
  "nz.ac.waikato.cms.weka" % "LibLINEAR" % "1.0.2",
  "de.bwaldvogel" % "liblinear" % "1.92",
  ...
)

However, I ran into a runtime error complaining of classes not being found in the package "liblinear". Turns out that the Weka LibLINEAR.java wrapper depends on the liblinear-java package, and version 1.0.2 in the repository attempts to dynamically instantiate the liblinear-java classes in the package "liblinear" whereas the classes are actually in the package "de.bwaldvogel.liblinear". I ended up removing the LibLINEAR dependency from build.sbt and copying the LibLINEAR.jar from $HOME/wekafiles into the lib directory as an unmanaged dependency to get around the priblem. Here is the output of the run:

# instances tested: 177541
# correctly predicted: 175054
Accuracy (%) = 98.5992
1
0

which shows performance similar to what I got with Scikit-Learn's LinearSVC. In retrospect, I should probably have used Weka's own text processing pipeline instead of trying to mimic Scikit-Learn's filtering and normalization in my Scla code, but this approach gives me the best of both worlds - processing text using Scikit-Learn's excellent API and the ability to deploy classifier models within a pure Java pipeline.