Wednesday, February 11, 2015

Classification with Positive Examples only


I recently came across a problem where I had to identify drug names in text. The drug names could be generic (eg, acetominophen, aspirin, etc) or brand names (Tylenol, Prilosec, etc). The thing about drug names is that they "look" slightly different from normal English words, so I thought that perhaps I could exploit this and build n-gram features for each word in the text and compare it to a model based on n-gram features of known drug names.

For a list of drug names, I used the drugbank.xml file, which contains 7,779 generic and 9,251 brand names. These are my positive examples, ie, I can derive from this set what a drug name should look like. However, I don't have a corresponding list of negative examples. What I do have are the words in the text, which can be thought of as unlabeled examples. The book Web Data Mining by Bing Liu describes several methods to build classifiers out of these inputs - this post describes how I used one of them.

In the past (before I learned of this approach), I would treat this as an unsupervised problem, comparing n-grams of unknown words against the n-gram set I had generated from the drugbank.xml file. Words that had a similarity above some (manually derived) threshold would be considered drugs, others not.

With my newly acquired knowledge, the approach is as follows. First, consider the unlabeled data as negative examples and combine it with the positive examples. Build a classifier, then use the predicted probability of the known positive examples as a lower bound to label the unlabeled examples with predicted probabilities above this lower bound. Repeat until the number of positive examples converge. Finally, use this classifier against another set of data.

The first step is to extract the generic and brand names from the XML file. Because of the size of the file, I decided to use a SAX parser. Here's the code. It parses the generic and drug names out of this file and writes it into a pair of text files.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Source: parse_drugbank.py
# -*- coding: utf-8 -*-
from sklearn.externals import joblib
import drug_ner_utils as dnu
import os
import xml.sax

class DrugXmlContentHandler(xml.sax.ContentHandler):
    
    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self.tags = []
        self.generic_names = []
        self.brand_names = []
        
    def startElement(self, name, attrs):
        self.tags.append(name)
    
    def endElement(self, name):
        self.tags.pop()
        
    def characters(self, content):
        breadcrumb = "/".join(self.tags)
        if breadcrumb == "drugbank/drug/brands/brand":
            self.brand_names.append(content)
        if breadcrumb == "drugbank/drug/name":
            self.generic_names.append(content)
    
def write_list_to_file(lst, filename):
    fout = open(os.path.join(dnu.DATA_DIR, filename), 'wb')
    for e in lst:
        fout.write("%s\n" % (e.encode("utf-8")))
    fout.close()

    
source = open(os.path.join(dnu.DATA_DIR, "drugbank.xml"), 'rb')
handler = DrugXmlContentHandler()
xml.sax.parse(source, handler)
source.close()

write_list_to_file(handler.generic_names, "generic_names.txt")
write_list_to_file(handler.brand_names, "brand_names.txt")

generic_fd = dnu.ngram_distrib(handler.generic_names, dnu.GRAM_SIZE)
brand_fd = dnu.ngram_distrib(handler.brand_names, dnu.GRAM_SIZE)

joblib.dump(generic_fd, os.path.join(dnu.DATA_DIR, "generic_fd.pkl"))
joblib.dump(brand_fd, os.path.join(dnu.DATA_DIR, "brand_fd.pkl"))

# Plot visualizations
dnu.plot_ngram_distrib(generic_fd, 30, "Generic", dnu.GRAM_SIZE)
dnu.plot_ngram_distrib(brand_fd, 30, "Brand", dnu.GRAM_SIZE)

This code produces a visualization of the distribution for the top 30 (by frequency) 3-grams, for generic names and brand names. Notice that the distribution is quite different, which is why we used different classifiers to detect each class.






The drug_ner_utils file contain some common functions and constants that are used by multiple processes. They all have fairly descriptive names, so it should be easy to infer what they do. The str_to_ngram function takes a string and the n-gram size to split it into and returns a list of ngrams. The ngram_distrib creates a frequency distribution from a list of tokens, and the plot_ngram_distrib function plots a visualization out of the first (most frequent) n-gram frequencies of the distribution. The truncate_fd is for selecing the first n-gram frequencies, and the vectorize takes a pair of files (positive and unlabeled), where each line corresponds to a single word, represented as a sequence of n-gram "words", and builds the X and y matrices in the format required by the classifier. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Source: drug_ner_utils.py
# -*- coding: utf-8 -*-
from operator import itemgetter
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import string

DATA_DIR = "../../data/drug_ner"
GRAM_SIZE = 3
PUNCTS = set([c for c in string.punctuation])
NUMBERS = set([c for c in "0123456789"])

def is_punct(c):
    return c in PUNCTS
    
def is_number(c):
    return c in NUMBERS
    
def str_to_ngrams(instring, gram_size):
    ngrams = []
    for word in nltk.word_tokenize(instring.lower()):
        try:
            word = "".join(["S", word, "E"]).encode("utf-8")
            cword = [c for c in word if not(is_punct(c) or is_number(c))]
            ngrams.extend(["".join(x) for x in nltk.ngrams(cword, gram_size)])
        except UnicodeDecodeError:
            pass
    return ngrams
            
def ngram_distrib(names, gram_size):
    tokens = []
    for name in names:
        tokens.extend(str_to_ngrams(name, gram_size))
    return nltk.FreqDist(tokens)

def plot_ngram_distrib(fd, nbest, title, gram_size):
    kvs = sorted([(k, fd[k]) for k in fd], key=itemgetter(1), reverse=True)[0:nbest]
    ks = [k for k, v in kvs]
    vs = [v for k, v in kvs]
    plt.plot(np.arange(nbest), vs)
    plt.xticks(np.arange(nbest), ks, rotation="90")
    plt.title("%d-gram frequency for %s names (Top %d)" % 
              (gram_size, title, nbest))
    plt.xlabel("%d-grams" % (gram_size))
    plt.ylabel("Frequency")
    plt.show()

def truncate_fd(fd, nbest):
    kvs = sorted([(k, fd[k]) for k in fd], key=itemgetter(1), reverse=True)[0:nbest]
    return {k:v for k, v in kvs}

def vectorize(ufile, pfile, max_feats):
    text = []
    labels = []
    fno = 0
    for fname in [ufile, pfile]:
        f = open(os.path.join(DATA_DIR, fname), 'rb')
        for line in f:
            text.append(line.strip())
            labels.append(fno)
        fno = fno + 1
        f.close()
    vec = CountVectorizer(min_df=0.0, max_features=max_feats, binary=True)
    X = vec.fit_transform(text)
    y = np.array(labels)
    return X, y, vec

The next step is converting the extracted names into corresponding n-grams. We use 3-grams for this experiment. The following code reads the files we had written out in the drugbank.xml parsing step and writes out corresponding files of n-grams. Each word in the input file are converted to a space separated sequence of 3-grams, similar to words in a sentence, and the format the downstream vectorizer expects. This results in 10,000 brand name words, 10,365 generic name words, and 199,018 unlabeled words.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Source: ngram_convert.py
# -*- coding: utf-8 -*-
import drug_ner_utils as dnu
import os

def build_ngram_text(infile, outfile):
    fin = open(os.path.join(dnu.DATA_DIR, infile), 'rb')
    fout = open(os.path.join(dnu.DATA_DIR, outfile), 'wb')
    for line in fin:
        for word in line.strip().split():
            ngrams = dnu.str_to_ngrams(word, dnu.GRAM_SIZE)
            if len(ngrams) > 0:
                fout.write("%s\n" % " ".join(ngrams))
    fin.close()
    fout.close()


build_ngram_text("generic_names.txt", "generic_positive.txt")
build_ngram_text("brand_names.txt", "brand_positive.txt")
build_ngram_text("raw_data.txt", "unlabeled.txt")

The next step is called cotraining. It uses the unlabeled set as input, setting it to be the negative example set, and adding to it the positive examples, and training a classifier. The hope is that the classifier will separate the unlabeled set into one that is "close" to the positive set and one that is not. The classifier reports, for each row, the probability that it is in the positive class. If we find the minimum probability for the positive examples (minus outliers), we can consider any row in the unlabeled set as positive if its associated prediction probability is higher than this minimum value. We repeat this step multiple times until the number of positive examples plateau. At each step, we feed it the output targets from the previous step. Here is the code to do this.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Source: co_train.py
# -*- coding: utf-8 -*-
from sklearn.externals import joblib
from sklearn.svm import LinearSVC
import drug_ner_utils as dnu
import matplotlib.pyplot as plt
import numpy as np
import os

MAX_ITERS = 10
EST_POSITIVE = 0.7
MAX_FEATURES = 3000

def conservative_min(xs):
    # remove outliers
    q25, q75 = np.percentile(xs, [25, 75])
    iqr = q75 - q25
    lb = q25 - (1.5 * iqr)
    ub = q75 + (1.5 * iqr)
    xs_con = xs[(xs >= lb) & (xs <= ub)]
    return np.min(xs_con)
    
    
for borg in ["generic", "brand"]:
    X, y, vec = dnu.vectorize("unlabeled.txt", "%s_positive.txt" % (borg), 
                              MAX_FEATURES)

    y_pos = y[y == 1]
    num_positives = [y_pos.shape[0]]

    clf = LinearSVC()
    clf.fit(X, y)

    num_iters = 0
    while (num_iters < MAX_ITERS):
        print("Iteration #%d, #-positive examples: %d" % 
              (num_iters, num_positives[-1]))
        confidence = clf.decision_function(X)
        min_pos_confidence = conservative_min(confidence[y_pos])
        y_pos = np.where(confidence >= min_pos_confidence)[0]
#        if y_pos.shape[0] <= num_positives[-1]:
#            break
        num_positives.append(y_pos.shape[0])
        y = np.zeros(y.shape)
        y[y_pos] = 1
        clf = LinearSVC()
        clf.fit(X, y)
        joblib.dump(y, os.path.join(dnu.DATA_DIR, "y_%s_%d.pkl" % 
                    (borg, num_iters)))
        num_iters += 1
    
    # visualize output
    plt.plot(np.arange(len(num_positives)), num_positives, "b-")
    plt.plot(np.arange(len(num_positives)), 
             X.shape[0] * EST_POSITIVE * np.ones(len(num_positives)), 'r--')
    plt.title("Cotraining for %s classifier (%d features)" % 
              (borg.title(), MAX_FEATURES))
    plt.xlabel("Iterations")
    plt.ylabel("#-Positives")
    plt.show()

The number of positive examples derived from this process is plotted on the charts below, the first for generic names and the second for brand names. Both classifiers plateau at around the Iteration #2. Our code pickles the classifier at each iteration so we can retrieve the one we need later.






Finally, we apply this classifier to actually detect drug names in our text. Based on the charts above, we choose the generic classifier from iteration #3 and the brand classifier from iteration #4, and pass in vectorized strings for classification. The accuracy scores for the generic and brand classifier (on training data) are 88.4% and 89.4% respectively. However, neither classifier works well on the test data. This is because my unlabeled data is very strongly positive - by my rough estimate, the number of words that are either generic or drug name is about 70% - so the classifier tends to mark everything as positive.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Source: apply_model.py
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
import drug_ner_utils as dnu
import numpy as np
import os

def vectorize_ngrams(ngrams, vocab):
    vec = np.zeros((1, len(vocab)))
    for ngram in ngrams:
        if vocab.has_key(ngram):
            vec[0, vocab[ngram]] = 1
    return vec
    

X, y, generic_vec = dnu.vectorize("unlabeled.txt", "generic_positive.txt", 100)
y = joblib.load(os.path.join(dnu.DATA_DIR, "y_generic_4.pkl"))
generic_clf = LinearSVC()
generic_clf.fit(X, y)
print("Score for generic classifier: %.3f" % (generic_clf.score(X, y)))

X, y, brand_vec = dnu.vectorize("unlabeled.txt", "brand_positive.txt", 100)

y = joblib.load(os.path.join(dnu.DATA_DIR, "y_brand_3.pkl"))
brand_clf = LinearSVC()
brand_clf.fit(X, y)
print("Score for brand classifier: %.3f" % (brand_clf.score(X, y)))

fraw = open(os.path.join(dnu.DATA_DIR, "raw_data.txt"), 'rb')
i = 0
for line in fraw:
    line = line.strip().lower()
    annotated = []
    for word in line.split():
        ngrams = dnu.str_to_ngrams(word, dnu.GRAM_SIZE)
        Xgen = generic_vec.transform([" ".join(ngrams)])
        Xbrand = brand_vec.transform([" ".join(ngrams)])
        is_generic = generic_clf.predict(Xgen)
        is_brand = brand_clf.predict(Xbrand)
        if is_generic == 1:
            annotated.append("<GENERIC>" + word + "</GENERIC>")
        elif is_brand == 1:
            annotated.append("<BRAND>" + word + "</BRAND>")
        else:
            annotated.append(word)
    print("Input: %s" % (line))
    print("Output: %s" % (" ".join(annotated)))
    i += 1
    if i > 10:
        break
fraw.close()

I attempted to correct for this by co-training with very high number of features (3000), so the classifier would learn to be more conservative in its decisions, and then retraining the test classifiers with much smaller number of features (100 and 50 for generic and brand classifiers respectively). I also tried scaling the data before training the classifiers, that did result in an increase in classifier accuracy (on training data) to 98% and 95%, but the resulting classifiers still marked almost all words as drugs, so it was not very effective.

So I guess its back to the drawing board for this one. However, it was interesting to learn of a method that allows us to build classifiers with no negative examples.

The code for this work is spread out across different files, and correspond to a specific step in the process. Communication is via flat files and pickled objects dropped by the component at the previous step. The code can be found on my GitHub project, the drugbank.xml file can be downloaded, and you will have to supply your own unlabeled data.

1 comments (moderated to prevent spam):

Sujit Pal said...

Thanks John, wasn't very successful as you noticed :-). Hopefully I have better data to use this against the next time.