Thursday, September 14, 2017

EMNLP 2017: Trip Report

Last week, I was at the EMNLP 2017 at Copenhagen. EMNLP is short for Empirical Methods for Natural Language Processing, and is one of the conferences of The Association for Computational Linguistics (ACL) that brings together NLP professionals from academia and industry to talk about their research and present their findings to each other. The conference itself was for 3 days - Saturday September 9 to Monday September 11 - but it was preceded by two days of tutorials and workshops, which I also attended. This is my trip report.


My main takeaway from the conference is that the NLP community is still heavily invested in deep learning. My frame of reference is NAACL 2015, the last ACL conference I attended, where the majority of the papers were about word embeddings and their applications. Papers this year continue to use word embeddings. But in addition, there are many other kinds of embeddings, such as character and subword embeddings to represent word morphology, phrase embeddings that marry the capabilities of NLP parsers to represent sentence structure. Both offer improvements over the Bag of Words approach or even combining word vectors through Bidirectional LSTMs to produce sentence (or higher abstraction) vectors.

In addition to the Bidirectional LSTM approach, many novel architectures were presented, including CRF-LSTMs, Graph LSTMs, and CNN-LSTMs. These modifications exploit the structure of natural language by providing extra information about phrase structure, emphasizing nearby words, or taking advantage of hierarchy imposed by the application (such as comment threads). Google's efforts with Machine Translation gave us the seq2seq model, but since then it's encoder-decoder architecture with optional attention has been adapted for many other NLP tasks that involve sequence inputs and outputs. In addition, Google briefly talked about their Transformer architecture, which is likely to become more important in coming years. Other interesting ideas are the use of adversarial techniques and joint learning to improve the accuracy of difficult tasks.

One other important trend I saw was the broader adoption of Reinforcement Learning (RL) techniques. I mostly think of RL in the context of game playing AIs, which implies that there is a physics engine somewhere to provide automated reinforcement during training. In the context of NLP, this physics engine seems to be search engine, optionally coupled with domain dependent retrieval rules. Applications taking advantage of RL seem to be mostly related to Learning to Rank (L2R), as far as I could see.

Finally, there were a few papers using more traditional techniques, such as the use of probabilistic graphical models or other Bayesian techniques, or based on clustering and topic modeling. In keeping with the focus on deep learning, almost all of them use word (and optionally character) embeddings to augment their feature set.

Structurally, the conference was organized into three parallel tracks, organized around themes such as Syntax, Semantics, Information Extraction, Machine Translation, Machine Learning, Language Generation, Discourse and Summarization, Multilingual NLP, Language Grounding, Multimodal NLP, Linguistic Theory, Computational Social Science, Sentiment Analysis, Dialog, and NLP Applications. In addition, there were 7 tutorials and 14 workshops held on the first 2 days, perhaps based on the premise that an attendee would either be a newbie or an expert, so you would find something to occupy your day. You could do a maximum of 3 tutorials (0.5 day per tutorial) or 2 workshops (1 day per workshop) if you attended the first 2 days.

There were also tons of poster sessions throughout the three days of the conference, and some of the ideas in these posters were really cool. One thing I found a bit annoying was that the posters would be up for a limited time and they would get changed after each session. This meant that either you miss a few talks if you wanted to do justice to the posters, or try to take in as many posters as you can during the coffee and lunch breaks. I chose to do the latter, except one time when a speaker failed to show up.

What follows is a brief description of the talks I attended, probably falls into the TL;DR category unless you want my personal take on the talks. Links to all papers presented at EMNLP can be found here (you might need ACL membership in the future, but they seem to be readily available now). In addition, the entire event was live streamed and the recordings are here. I am guessing that the recordings of the individual talks will eventually make it to a Youtube channel once the editing process is completed. I will update the post with the links once that happens. If you find the Youtube videos first, please let me know in the comments and I will update.

Tutorials and Workshops

Tutorial: Acquisition, Representation and Usage of Concept Hierarchies - by Marius Pesca (abstract)
A brief but very representative overview of techniques used to extract and represent entities in IS-A relationships, and various techniques for using these concepts in search applications. I could identify a few techniques I knew about, but there were quite a few I did not, so it was very useful for me.

Tutorial: Graph based text representations: Boosting text mining, NLP and information retrieval with graphs - by Fragkiskos D Malliaros and Michalis Vazirgiannis (abstract)
Very comprehensive coverage of graph techniques for NLP, using graph of words for information retrieval, text summarization using k-core decomposition, using graph based document representations for clustering, subgraph extraction and frequent subgraph mining techniques. Again, the benefit to me was the breadth of coverage.

Tutorial: Memory Augmented Neural Networks (MANN) for Natural Language Processing - by Caglar Gulcehre and Sarath Chandar (abstract)
Despite the success of LSTMs for solving NLP problems, there are still some complex tasks that need the ability to store and retrieve information on demand from an external store because they need to look at information that is too far in the past (or future) for an LSTM's hidden vector to provide. The resulting architecture is the Neural Turing Machine (NTM), and this tutorial discusses NTMs in quite a bit of depth.

Workshop: evaluating vector space representations in NLP
I attended part of this on the second day, highlights of the workshop for me were the talks by Yejin Choi from University of Washington, Jacob Uszkoreit from Google and Kyunghyn Cho (of GRU fame) of New York University. Yejin spoke about the need for extracting tactile information from the physical world and using it in reasoning. Jacob talked at length about the Transformer Architecture in connection with Machine Translation (and Language Understanding), and Cho spoke about using character models.

Conference Day #1

Keynote: Physical Simulation, Learning and Language - by Nando de Freitas
Nando de Freitas spoke about the need to build systems that can learn to learn from the environment like a general AI, and described a framework that allows researchers to simulate a physical world at faster than real time, that has led to many improvements in robotics. He argued for the need for something similar in the area of language research.

Monolingual Phrase Alignment on Parse Forests - by Yuki Arase and Jun'ichi Tsuji
Presenter described a tree-based method to detect and align phrases using paraphrase statements. In the process they have developed a gold dataset of parse trees and phrase alignments that they offer to fellow researchers.

Heterogeneous Supervision for Relation Extraction: A representation learning approach - by Liyuan Liu, Xiang Ren, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji and Jiawei Han.
Presenter described a method to learn to learn relation extraction using domain heuristics and knowledge bases. The resulting learning is quite noisy, which are resolved using reliability ranking of sources and context embeddings much like word disambiguation using word vectors.

Mimicking word embeddings using subword RNNs - by Yuval Pinter, Robert Guthrie, Jacob Eisenstein
Presenter described results from their system MIMICK against other subword embedding systems such as Char2Tag. MIMICK can generate subword embeddings for Out of Vocabulary (OOV) words, using subword embeddings, much like word vectors are used to generate sentence vectors using the BiLSTM approach. Code for MIMICK can be found at the link.

Entity Linking for Queries by Searching Wikipedia Sentences - Chuanqui Tan, Furu Wei, Pengie Ren, Weifeng Lo and Ming Zhu
Extracting entities from queries can make disambiguation easier. System uses a search index to retrieve sentences containing the query terms and does entity extraction on the resulting sentences to find entities in the query. For word disambiguation, the presenters used the system supWSD (supWSD code), which is a supervised WSD system, and provides a toolkit and trained models.

End to end neural coreference resolution - by Kenton Lee, Luheng He, and Luke Zettlemoyer
Presenter describes an end-to-end system (e2e-coref) similar to Question Answering (QA) systems, where a document is broken up into spans using standard parsing techniques. All spans are treated as mention spans and a network used to detect similar mentions and cluster them. Code for the e2e-coref system can be found at this link.

Neural Machine Translation with word prediction - by Rongxiang Weng, Shujian Huang, Zaixiang Zheng, Xin-Yu Dai, and Jiajun Chen
The presenter suggests a change to the standard seq2seq model used for machine translation, to also include all previous predictions at each stage in the decoder sequence, and use the top K words as the vocabulary. They have found that it improves translation performance.

Affinity preserving random walk for multi-document summarization - by Kexiang Wang, Tianyu Liu, Zhifung Sui, and Baobao Chang
Output of MDS is a short text that summarizes all the documents in the MD collection. Presenter describes a graph based method that collects the entities from all documents in the collection, and then executes a random walk similar to Pagerank. Once the process converges, the important entities of the graph can be converted to the MDS.

Google's multilingual Neural Machine Translation System: Enabling Zero Shot Translation - by Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yanghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viegas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean
Google's NMT model is well known. Presenter describes results of various experiments, including creating a combined multi-language model, creating languages for CJK and European languages and how they perform among different groups of languages. Turns out that multi-language NMT results in higher performance, also NMT trained on one language family is more effective (for certain language families) on their own family than on others.

DeepPath: A reinforcement learning method for knowledge graph reasoning - by Wenhan Xiong, Thien Hoang, and William Yang Wang
Presenter describes their DeepPath system, which uses Reinforcement Learning (RL) and Knowledge Graph embeddings to learn to find the most promising relation in a KG to extend the path. Code for DeepPath is available here.

Task Oriented Query Reformulation with Reinforcement Learning - by Rodrigo Nogueira and Kyunghyun Cho
Presenter describes a RL based Neural Network (NN) that reformulates complex user queries to maximize the number of relevant documents returned. The reward function used is the document recall. Code for the Query Reformulator is available here.

Sentence Simplification with Deep Reinforcement Learning - by Xingxing Zhang and Mirella Lapato
Presenter describes a RL based DL system for sentence simplification system called DRESS (Deep REinforcement Sentence Simplification). Reward function used is the SARI metric which rewards similarity, simplicity and correct grammar. Code for DRESS is available here.

Learning how to active learn: A Deep Reinforcement Learning Approach - by Meng Fang, Yuan Li and Trevor Cohn.
Presenter describes their system which uses RL to learn a policy to do Named Entity Recognition (NER) in one language, and apply the same policy to do NER in another language. The policy learned is based on labeling functions developed against a small dataset in the original language. Code for the RL system is here.

Conference Day #2

Keynote: Towards more universal language technology: unsupervised learning from speech - by Sharon Goldwater
Sharon makes a case for unsupervised and semi-supervised learning and describes her work on unsupervised learning in the area of speech. Results are not very good but the task is very hard. Some of her ideas may be directly transferable to language, but she makes the case that the NLP community should also invest effort in unsupervised techniques going forward.

A structured learning approach to temporal relation extraction - by Qiang Ning, Zhili Feng and Dan Roth.
Presenter describes the difficulty with manually annotating temporal relations in text, and proposes a graph based approach with verbs connected by candidate temporal relation edges, computing pairwise KL divergence between the nodes and comparing to KL divergence between two entities with uniform distribution.

Importance sampling for unbiased on-demand evaluation of knowledge base population - by Arun Chaganty, Ashwin Paranjpe, Percy Liang and Christopher D Manning
Presenter discusses how NER system evaluation is inherently biased in that it penalizes new findings from new NER systems. He proposes a way to sample from the predictions of the new NER system and verify that these findings are valid using crowdsourcing. Resulting approach is cheaper than naive crowdsourcing and removes bias in evaluation. Project code is here, and here is the Online Demo.

PACRR: A position aware neural IR model for relevance matching - by Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo.
Presenter describes a NN based model that mimicks the relevance formula in a search engine. Since the input embeddings are word vector based, the intuition is that the resulting model will be better at capturing semantic similarity. Good results are already available for unigram matching, this work explores the effect of position of context words on relevance matching. Code for PACRR is here.

Globally Normalized Reader - by Jonathan Raiman and John Miller.
Presenter describes their GNR system that does QA using iterative search instead of typical bidirectional attention mechanism. Results are back propagated through beam search, and is found to produce better results against the SQUAD dataset. The team has also used a novel data augmentation method for their training, and they offer the dataset as well to interested researchers. Code for the Globally Normalized Reader can be found here.

Encoding sentences for graph convolutional networks for semantic role labeling - by Diego Marcheggiani and Ivan Titov.
Presenter describes a Graph Convolutional Network (GCN) for modeling syntax dependency graphs, and their use as sentence encoders for Semantic Role Labeling (SRL) applications. They note that GCNs are complementary to LSTMs, and stacking them together results in improved results in identifying predicate-argument structures in a sentence, compared to the previous state of the art LSTM-based SRL model. Code for the NN based SRL system is here.

Neural Semantic Parsing with Type Constraints for Semi-structured tables - by Jayant Krishnamurthy, Pradeep Dasigi and Matt Gardner.
Presenter describes their model which learns how to answer compositional questions on semi-structured Wikipedia tables. Input is the natural language question and output is a well-typed logical form for navigating and looking up the answer. Dataset used is the Wikitable Questions.

Joint Concept Learning and Semantic Parsing from Natural Language Explanations - by Shashank Srivastava, Igor Labutov, and Tom Mitchell.
Presenter describes their system that certain features of text explanations to identify concepts. For example, the presence of "bank account number" in an explanation about phishing. Label functions are generated from these texts and used to identify a concept.

Opinion Recommendation using a Neural Model - by Zhongquing Wan and Yue Zhang.
Presenter describes their system that jointly generates a custom review score and a review for a given user, given his other reviews and scores. Task is novel, hence a new name Opinion Recommendation. Inputs are 3 NNs which model the reviews about the product, the user, and the user neighborhood (other users). These are concatenated using multi-hop attention (which seems to be iterative dot products) and form the input to another NN that outputs the score and the generated review using a standard encoder-decoder architecture.

Accurate Supervised and Semi-supervised machine reading for long documents - by Daniel Hewlett, Llion Jones and Alexandre Lacoste.
Presenter describes a standard QA network, the novel bit is that documents are split into equal sized parts (best results found with chunk size of 30 words) and encoded using RNNs in parallel. The network then attends over these separate encodings and reduces them to a single encoding, which is then decoded into an answer using a sequence decoder.

Adversarial Examples for Evaluating Reading Comprehension Systems - Robin Jia and Percy Liang.
Presenter describes how adding extra information to documents in a QA scenario can lead to a QA system giving the wrong answer. This is similar to the adversarial examples used in vision. They then propose an evaluation scheme for QA systems using this idea to measure if the QA system is demonstrating true language understanding versus just learning how to do pattern matching.

Joint modeling of Topics, Citations and Topical Authority in Academic Corpora - by Jooyeon Kim, Dongwoo Kim and Alice Oh.
Presenter introduces Latent Topical Authority Indexing (LTAI) which they show is a better way to expose topic signals from papers and authors than current techniques. LTAI can be used to find an expert on a topic, compare topical authority among multiple authors. The model used is a Programmable Graphical Model (PGM) which uses Expectation Maximization (EM) to compute the LTAI metric.

Identifying semantic intentions from revisions in wikipedia - Diyi Yang, Aran Halfaker, Rober Kraut, and Eduard Hovy.
Presenter talks about a 13 category taxonomy of semantic intention behind Wikipedia edits, and describes a classifier that can predict the intention given the user's edit history. This also opens up avenues for research into behavior of Wikipedia editors.

Conference Day #3

Keynote: Processing the language of policing for Improving Police-Community Relations. - by Dan Jurafsky
Dan Jurafsky speaks about his recent research into how the language policemen use exhibit a racial bias. Data for the research comes from 1 month of video footage from body cameras worn by Oakland PD officers. He also gave a brief update on his ongoing research into food and sociology. The theme of the talk was the need for NLP to do cross-disciplinary research so it can have a greater impact.

Part of Speech tagging for Twitter with Adversarial Neural Networks - by Tao Gui, Qi Zhang, Haoran Huang, Minlong Peng and Xuanjiang Huang
Presenter describes how combining POS tags from a high resource corpus such as WSJ, as well as character embeddings from both WSJ and Twitter, enables learning of POS tags on Twitter using an adversarial discriminator setup.

Learning Generic Sentence Representation using Convolutional Neural Networks - Zhe Gan, Yunchen Pu, Ricardo Henau, Chunyuan Li, Xiadong He, and Lawrence Carin
Presenter proposes a new encoder-decoder approach to learn distributed sentence/paragraph representations using a CNN-LSTM or hierarchical CNN-LSTM network instead of using LSTMs for both encoder and decoder as is done currently. He presents empirical evidence showing that performance is as good or better than using LSTM for both encoder and decoder.

Conversation Modeling on Reddit using a Graph Structured LSTM - by Victoria Zayats and Mari Ostendorf
Presenter describes her project to capture keywords/topics for popular vs unpopular Reddit comments (objective is to find what makes some comments popular vs not popular for a given subreddit). Since Reddit comments are hierarchical, a Graph LSTM is used, which builds the hidden component of the input from both the parent comment as well as the previous comment. Learned a nice method of quantization by selecting the median of each quantile of the score distribution as the threshold.

Learning what to read: Focused Machine Reading - by Enrique Noriega-Atala, Marco A Valenzuela-Escdrcega, Clayton Morrison, and Mihai Surdeanu
Presenter describes project to capture statements of the form "A related-to B given context C" on the Pubmed OpenAccess (OA) dataset. A pair of entities are chosen and queries fired against the corpus to find all possible entity pairs. RL is used to score the best path between the two specified entities, results in approximately 40% of path lookups compared to exhaustive search.

DOC: Deep Open Classification of text documents - by Lei Shu, Hu Xu, and Bing Liu
This talk is unique in that it makes the open world assumption, instead of a document being classified into 1 of N classes, the document can also be not one of the N classes, as well as belong to more than one of N classes. Approach is to create one-vs-rest classifiers for each class, and then softmax across their scores to find the classes. Thresholding to detect the class(es) to assign to each document involves fitting a gaussian to a histogram of scores for positive labels for each class, and then considering the mean + a multiple of the standard deviation as the threshold.

Exploiting Cross Sentence Context for Neural Machine Translation - by Longyue Wang, Zhaopeng Tu, Andy Way and Qun Liu
Presenter describes a novel idea of computing the context (3 previous sentences to current sentence being translated) and using it as additional input to the encoder, decoder, or even for attention during decoding. Experiments show that the additional context results in better scores on their test data. Code for the project is available here.

Cross Lingual Transfer learning for POS Tagging without cross lingual resources - by Joo-Kiyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier
Yet another example of adversarial learning in the NLP space, where the POS tags for the resource rich language are used to train a discriminator which is then used to train a generator to generate POS tags for the resource poor language. Code for the tagger is available here.

A Simple Regularization based Algorithm for learning Cross-Domain Word Embeddings - by Wei Yang, Wei Lu and Vincent Zhang.
Presenter describes building a graph using entities from a given domain as the nodes, and the edges weighted using the cosine distance between their vector representations. Then an iterative algorithm such as Pagerank is run until convergence. Cross domain word embeddings are learned by running analogies between selected words in one domain and single words in the other.

Best Paper: Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps - by Tobias Falke and Iryna Gurevych.
Presenter describes a system that extracts entities from one or more documents, sets them up in a graph based on cosine distance between their word vectors, and finds the most important entities. These entities are then sent to human experts in a crowdsourcing arrangement who construct the summaries manually.

The other 3 papers in the best paper category were around the language learned when two automated agents engage in dialog, predicting depression and suicide risk in online forums, and how to correct for dataset bias for machine learning models.


Presentations were predominantly from academia, which is kind of expected, since most of the papers tend to push the envelope of the state of the art, something academia tends to do. Among US universities, Stanford and Carnegie Mellon are well known for their NLP, so as expected, there were quite a few presentations from there. University of Washington also had quite a few good submissions. I also saw a lot of presentations from Chinese universities, looks like NLP and Deep Learning are quite popular in China. From industry, I saw two presentations from Google and one from Baidu.

That's pretty much all I have for this week. I made a few awesome friends, thanks to an introduction from a colleague, with some other attendees who were local to Copenhagen. Thanks to their help, I got to eat authentic Italian pizza and spicy Indian food in Copenhagen at an area right next to the conference, but which I am pretty sure I wouldn't have found on my own :-).

Monday, August 21, 2017

Improving Sentence Similarity Predictions using Attention and Regression

If you have been following my last two posts, you will know that I've been trying (unsuccessfully so far) to prove to myself that the addition of an attention layer does indeed make a network better at predicting similarity between a pair of inputs. I have had good results with various self attention mechanism for a document classification system, but I just couldn't replicate a similar success with a similarity network.

Upon visualizing the intermediate outputs of the network, I observed that the attention layer seemed to be corrupting the data - so instead of highlighting specific portions of the input as might be expected, the attention layer output actually seemed to be more uniform than the input. This led me to suspect that either my model or the attention mechanism was somehow flawed or ill-suited for the problem I was trying to solve.

My similarity network was trying to treat the similarity problem as a classification problem, i.e, it would predict one of 6 discrete "similarity classes". However, the training data provided the similarities as continuous floating point numbers between 0 and 5. The examples I had seen before for similar Siamese network architectures (such as this example from the Keras distribution) typically minimize a continuous function such as Contrastive Divergence. So I decided to change my network to a regression network, more in keeping with the data provided and examples I had seen. This new network would learn to predict a similarity score by minimizing the Mean Squared Error (MSE) between label and prediction. The optimizer used was RMSProp.

With the classification network, the validation loss (categorical cross-entropy) on even the baseline (no attention) network kept oscillating and led me to believe that the network was overfitting on the training data. The learning curve from the baseline regression network (w/o Attn) looked much better, so the change certainly appears to be a step in the right direction. The network was evaluated using Root Mean Square Error (RMSE) between label and prediction on a held-out test set. In addition, I also computed the Pearson correlation and Spearman (rank) correlation coefficients between label and predicted values in the test set.

In addition, I decided to experiment with some different Attention implementations I found on the Tensorflow Neural Machine Translation(NMT) page - the additive style proposed by Bahdanau, and the multiplicative style proposed by Luong. The equations here are in the context of NMT, so I modified the equations a bit for my use case. In addition, I found that the attention style I was using from the Parikh paper is called the dot product style, so I included that too below with similar notation, for comparison. Note that the difference in "style" pertains only to how the alignment matrix α is computed, as shown below.

The alignment matrix is combined with the input signal to form the context vector, and the context vector is concatenated with the input signal and weighted with a learned weight and passed through a tanh layer.

One other thing to note is that unlike my original attention implementation, the alignment matrix in these equations is formed out of the raw inputs rather than the ones scaled through a tanh layer. I did try using scaled inputs with the dot style attention (MM-dot(s)) - this was my original attention layer without any change - but the results weren't as good as dot style attention without scaling (MM-dot).

For the two new attention styles, I added two new custom Keras Layers AttentionMMA for the additive (Bahdanau) style, and AttentionMMM for the multiplicative (Luong) style. These are called from the model with additive attention (MM-add) and model with multiplicative attention (MM-mult) notebooks respectively. The RMSE, Pearson and Spearman correlation coefficients for each of these models, each trained for 10 epochs, are summarized in the chart below.

As you can see, the dot style attention doesn't seem to do too well against the baseline, regardless of whether the input signal or scaled or not. However, both the additive and multiplicative attention styles result in a significantly lower RMSE and higher correlation coefficients than the baseline, with additive attention being giving the best results.

That's all I have for today. I hope you found it interesting. There are many variations among Attention mechanisms, and I was happy to find two that worked well with my similarity network.

Saturday, August 12, 2017

Visualizing Intermediate Outputs of a Similarity Network

In my last post, I described an experiment where the addition of a self attention layer helped a network do better at the task of document classification. However, attention didn't seem to help for another experiment where I was trying to predict sentence similarity. I figured it might be useful to visualize the outputs of the network at each stage, in order to see where exactly it was failing. This post describes that work. The visualizations did give me pointers to what was happening, and I tried some of these ideas out, but so far I haven't been able to get a network with attention to perform better than a network without it at the similarity task.

The diagram below illustrates the structure of the network whose outputs I was trying to visualize. The network is built to predict the similarity between two sentences on a 6 point scale. The training data comes from the Semantic Similarity Task Dataset for 2012, and consists of sentence pairs and associated similarity score (floating point numbers) between 0 and 5. For this experiment, I quantize the labels into 6 different similarity classes, and attempt to predict that value. Word vectors are looked up from pretrained GloVe embeddings for each word in the two sentence pair, then the sequence of word vectors sent through a Bidirectional LSTM to produce a encoded sentence matrix for each sentence in the pair. The sentence matrices are then sent through an attention layer to create a vector that first creates an alignment matrix between the two sentence matrices, then uses the alignment matrix to determine how much to weight each part of the two sentences when producing the output vector. The output vector is then fed into a Fully Connected network to do the final prediction.

I wanted to visualize the outputs at each stage of the network to see how they differed at each stage. So I first selected three sentence pairs with label similarity values approximately equidistant along the label range. For each sentence, I computed the (a) similarity matrices for the input (one-hot) vector sequence for each sentence, (b) their word vector sequence after embedding, (c) the sentence vector after encoding, (d) the alignment between the two sentence matrices, (e) and the similarity matrix between the aligned sentences. Each of these matrices are represented as a heat map for visualization. In addition, (f) I also used the alignment between the two embeddings to compute the weighted sentence matrix to see if that made any difference.

Each heatmap also has a crude measure of "similarity" that divides the sum of the diagonal elements by the sum of all the elements.

The sequence of heatmaps below show the outputs for a network trained for 10 epochs with a training accuracy of 0.8, validation accuracy of 0.7 and training accuracy of 0.4. The sentence pair that generated these outputs are as follows:

Left: A man is riding a bicycle.
Right: A man is riding a bike.
Score: 5.0

Next, we consider a slightly less similar (according to the score label) sentence pair as follows:

Left: A woman is playing the flute.
Right: A man is playing the flute.
Score: 2.4

Finally, we consider a pair of sentences which are even more dissimilar.

Left: A man is cutting a potato.
Right: A woman is cutting a tomato.
Score: 1.25

In all cases, the heatmap for the input is self-explanatory, since common words are down the diagonal. The output of the embedding step also kind of makes sense, since bicycle and bike in the first case, man and woman in the second and third cases, and potato and tomato in the third case show a non-zero resemblance. In all cases, the resulting sentence matrix (output of the encoding step) results in a blurry blob indicating the similarity between the two sentences in the pair. I did expect the alignments to be more meaningful - in all 3 cases above, there doesn't seem to be a meaningful pattern. Since the attention output is dependent on the alignment, there is no meaningful pattern there either.

Results from computing the alignment against the embedding output and weighting the encoding output to produce the attention output results in slightly more meaningful patterns. For example, in all 3 cases, the terminating period seems to be unimportant. Strangely, common words seem to hold less importance than I would have expected. Sadly, though, my crude measure of similarity does not match up with the labels, regardless of which pair of outputs I use for my alignment.

Here is the notebook that renders these visualizations, and here is the notebook to build the pre-trained model on which the visualization is based. I used a combination of model.predict() to generate outputs of sub-networks, as well as extracting the trained weights from the model, and applying numpy operations to get results.

That's all I have for today, hope you found it interesting.

Saturday, July 22, 2017

The Benefits of Attention for Document Classification

A couple of weeks ago, I presented Embed, Encode, Attend, Predict - applying the 4 step NLP recipe for text classification and similarity at PyData Seattle 2017. The talk itself was inspired by the Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models blog post by Matthew Honnibal, creator of the spaCy Natural Language Processing (NLP) Python toolkit. In it, he posits that any NLP pipeline can be constructed from these 4 basic operations and provides examples from two of his use cases. In my presentation, I use his recipe to construct deep learning pipelines for two other processes - document classification and text similarity.

Now I realize that it might seem a bit pathetic to write a blog post about a presentation about someone else's blog post. But the reason I even came up with the idea for the presentation was because Honnibal's idea of using these higher level building blocks struck me as being so insightful and generalizable that I figured that it would be interesting to use it on my own use cases. And I decided to do the blog post because I thought that the general idea of abstracting a pipeline using these 4 steps would be useful to people beyond those who attended my talk. I also hope to provide a more in-depth look at the Attend step here than I could during the talk due to time constraints.

Today, I cover only my first use case of document classification. As those of you who attended my talk would recall, I did not get very good results for the second and third use cases around document and text similarity. I have a few ideas that I am exploring at the moment. If they are successful, I will talk about them in a future post.

The 4 step recipe

For those of you who are not aware of the 4-step recipe, I refer you to Honnibal's original blog post for the details. But if you would rather just get a quick refresher, the 4 steps are as follows:

  • Embed - converts an integer into a vector. For example, a sequence of words can be transformed through vocabulary lookup to a sequence of integers, each of which could be transformed into a fixed size vector represented by the word embedding looked up from third party embeddings such as word2vec or GloVe.
  • Encode - converts a sequence of vectors into a matrix. For example, a sequence of vectors representing some sequence of words such as a sentence, could be sent through a bi-directional LSTM to produce a sentence matrix.
  • Attend - reduces the matrix to a vector. This can be done by passing the matrix into an Attention mechanism that captures the most salient features of the matrix, thus minimizing the information loss during reduction.
  • Predict - reduces a vector to a integer label. This would correspond to a fully connected prediction layer that takes a vector as input and returns a single classification label.

Of these steps, all but the Attend step is adequately implemented by most Deep Learning toolkits. My examples use Keras, a Python deep learning library. In Keras, the Embed step is represented by the Embedding layer where you initialize the weights from an external embedding; the Encode step can be implemented using a LSTM layer wrapped in a Bidirectional wrapper; and the Predict step is implemented with a Dense layer.

Experiment: Document Classification

These steps can be thought of as large logical building blocks for our NLP pipeline. A pipeline can be composed of zero or more of these steps. It is also important to realize that each of these steps has a naive, non deep learning equivalent. For example, the Embed step can be done using one-hot vectors instead of third party word embeddings; the Encode step can be done by just concatenating the vectors along their short axis; the Attend step can be done by averaging the component word vectors; and the Predict step can use an algorithm other than deep learning. Since I wanted to see the effect of each of these steps separately, I conducted the following set of experiments - the links lead out to Jupyter notebooks on Github.

The data for this experiment comes from the Reuters 20 newsgroups dataset. It comes as part of scikit-learn's datasets package. It is a collection of 180000 newsgroup postings pre-categorized into one of 20 newsgroups. Our objective is to build a classifier (or classifiers) that can predict the document's newsgroup category from its text.

  • Embed and Predict (EP) - Here I treat a sentence as a bag of words and a document as a bag of sentences. So a word vector is created by looking it up against a GloVe embedding, a sentence vector is created by averaging its word vectors, and a document vector is created by averaging its sentence vectors. The resulting document vector is fed into a 2 layer Dense network to produce a prediction of one of 20 class.
  • Embed, Encode and Predict (EEP) - We use a document classification hierarchy as described in this paper by Yang, et al.[1]. Specifically, a sentence encoder is created that transforms integer sequences (from words in sentences) into a sequence of word vectors by looking up GloVe embeddings, then converts the sequence of word vectors to a sentence vector by passing it through a Bidirectional LSTM and capturing the context vector. This sentence encoder is embedded into the document network, which takes in a sequence of sequence of integers (representing a sequence of sentences or a document). The sentence vectors are passed into a Bidirectional LSTM encoder that outputs a document vector, again by returning only the context vector. This document vector is fed into a 2 layer Dense network to produce a category prediction.
  • Embed, Encode, Attend and Predict #1 (EEAP#1) - In this network, we add an Attention layer in the sentence encoder as well as in the Document classification network. Unlike the previous network, the Bidirectional LSTM in either network returns the full sequences, which are then reduced by the Attention layer. This layer is of the first type as described below. Output of the document encoding is a document vector as before, so as before it is fed into a 2 layer Dense network to produce a category prediction.
  • Embed, Encode, Attend and Predict #2 (EEAP#2) - The only difference between this network and the previous one is the use of the second type of Attention mechanism as described in more detail below.
  • Embed, Encode, Attend and Predict #3 (EEAP#3) - The only difference between this network and the previous one is the use of the third type of Attention mechanism. Here the Attention layer is fed with the output of the Bidirectional LSTM as well as the output of a max pool operation on the sequence to capture the most important parts of the encoding output.

The results of the experiment are as follows. The interesting values are the blue bars, that represent the accuracy reported by each trained model on the 30% held out test set. As you would expect, the Bag of Words (EP) approach yields the worst results, around 71.4%, which goes up to 77% once we replace the naive encoding with a Bidirectional LSTM (EEP). All the models with Attention outperform these two models, and the best result is around 82.4% accuracy with the first Attend layer (EEAP#1).

Attention Mechanisms

I think one reason Keras doesn't provide an implementation of Attention is because different researchers have proposed slightly different variations. For example, the only toolkit I know that offers Attention implementations is Tensorflow (LuongAttention and BahdanauAttention), but both are in the narrower context of seq2seq models. Perhaps a generalized Attention layer is just not worth the trouble given all the variations and maybe it is preferable to build custom one-offs yourself. In any case, I ended up spending quite a bit of time understanding how Attention worked and how to implement it myself, which I hope to also share with you in this post.

Honnibal's blog post also offers a taxonomy of different kinds of attention. Recall that the Attend step is a reduce operation, converting a matrix to a vector, so the following configurations are possible.

  • Matrix to Vector - proposed by Raffel, et al.[2]
  • Matrix to Vector (with implicit context) - proposed by Lin, et al.[3]
  • Matrix + Vector to Vector - proposed by Cho, et al.[4]
  • Matrix + Matrix to Vector - proposed by Parikh, et al.[5]

Of these, I will cover the first three here since they were used for the document classification example. References to the papers where these were propsed are provided at the end of the post. I have tried to normalize the notation across these papers so it is easier to talk about them in relation with each other.

I ended up implementing them as custom layers, although in hindsight, I could probably have used Keras layers to compose them as well. However, that approach can be complex if your attention mechanism is complicated. If you want an example of how to do that, take a look at Spacy's implementation of decomposable attention used for sentence entailment.

There are many blog posts and articles that talk about how Attention works. By far the best one I have seen is this one from Heuritech. Essentially, the Attention process involves combining the input signal (a matrix) with some other signal (a vector) to find an alignment that tells us which parts of the input signal we should pay attention to. The alignment is then combined with the input signal to produce the attended output. Personally, I have found that it helps to look at a flow diagram to see how the signals are combined, and the equations to figure out how to implement the layer.

Matrix to Vector (Raffel)

This mechanism is a pure reduction operation. The input signal is passed through a tanh and a softmax to produce an alignment matrix. The dot product of the alignment and the input signal is the attended output.

Two things to note here is the presence of the learnable weights W and b. The idea is that the component will learn these values so as to align the input based on the task it is being trained for.

The code for this layer can be found in class AttentionM in the custom layer code.

Matrix to Vector (Lin)

This mechanism is also a pure reduction operation, since the input to the layer is a matrix and the output is a vector. However, unlike the previous mechanism, it learns an implicit context vector u, in addition to W and b, as part of the training process. You can see this by the presence of a u vector entering the softmax and in the formula for αt.

Code for this Attention class can be found in the AttentionMC class in the custom layer code.

Matrix + Vector to Vector (Cho)

Unlike the previous two mechanisms, this takes an additional context vector that is explicitly provided along with the input signal matrix from the Encode step. This can be a vector that is generated by some external means that is somehow representative of the input. In our case, I just took the max pool of the input matrix along the time dimension. The process of creating the alignment vector is the same as the first mechanism. However, there is now an additional weight that learns how much weight to give to the provided context vector, in addition to the weights W and b.

Code for this Attention class can be found in the AttentionMV class in the code for the custom layers.

As you may have noticed, the code for the various custom layers is fairly repetitive. We declare the weights in the build() method and the computations with the weights and signals in the call() method. In addition, we support input masking via the presence of the compute_mask() method. The get_config() method is needed when trying to save and load the model. Keras provides some guidance on building custom layers, but a lot of the information is scattered around in Keras issues and various blog posts. The Keras website is notable, among other things, for the quality of its documentation, but somehow custom layers haven't received the same kind of love and attention. I am guessing that perhaps it is because this is closer to the internals and hence more changeable, so harder to maintain, and also once you are doing custom layers, you are expected to be able to read the code yourself.

So there you have it. This is Honnibal's 4-step recipe for deep learning NLP pipelines, and how I used it for one of the use cases I talked about at PyData. I hope you found the information about Attention and how to create your own Attention implementations useful.


  1. Yang, Z, et al (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT (pp. 1480-1489).
  2. Raffel, C, & Ellis, D. P (2015). Feed-forward networks with attention can solve some long term memory problems. arXiv preprint arXiv:1512.08756.
  3. Lin, Z., et al. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
  4. Cho, K, et al. (2015). Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886.
  5. Parikh, A. P., et al (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.

Wednesday, July 12, 2017

Trip Report: PyData Seattle 2017

Last week I attended (and presented at) PyData Seattle 2017. Over time, Python has morphed from a scripting language, to a library for scientific computing, and lately pretty much a standard language for most aspects of Machine Learning (ML) and Artificial Intelligence (AI), including Deep Learning (DL). PyData conferences cater mostly to the last demographic. Even though it's not really a place where you go to learn about the state of the art, it's still a great place to catch up with what others in industry are doing with Python and ML/AI/DL. PyData conferences are usually 2-3 day affairs, and they happen multiple times a year, at different places all over the world, organized by local groups of Python/ML enthusiasts.

The conference was 3 days long, one day of tutorials followed by 2 days of presentations. It was held at the Microsoft Campus - along with the conference center, Microsoft also sponsored the food. I stayed at the Hyatt Regency Bellevue, their "preferred" hotel - initially I thought that it meant they would have a shuttle service to and from the conference, but it was because of lower negotiated room rates for conference attendees. But thanks to ridesharing services such as Lyft, I had no problems getting around.

So anyway, here is my trip report... there were 4 simultaneous tracks so there are things I missed because I wanted to attend something even better. Fortunately, there are videos of the talks and the organizers are collecting the slides from the speakers, and all of this will be made available in 2-3 weeks. I will update the links to slides and video when that happens.

Day 1 (July 5, 2017)

pomegranate: fast and flexible probabilisting modeling in python - Maxwell W Libbrecht

I first came across the pomegranate library at PyData Amsterdam last year, where it was mentioned as a package containing several probabilistic graphical models (PGMs), and specifically to build Bayesian Networks. I read the docs in advance, and it turns out that it also contains Hidden Markov Models, General Mixture Models and Factor Graphs. The talk itself was mostly a walkthrough of its capabilites using this notebook. Like many other ML packages in the Python (and other) ecosystem, pomegranate's API is modeled after that of scikit-learn. The examples were very cool, and left me itching for a problem that I might be able to solve using pomegranate :-).

Vocabulary Analysis of Job Descriptions - Alex Thomas

Alex Thomas of led us through analyzing the vocabulary of job descriptions. The objective is to extract attributes from the text of job descriptions, which might be used as structured features for these descriptions for downstream tasks. He started with basic ideas like TF-IDF, finding multi-word candidates, using stopwords and extending them, stemming and lemmatizing. Evaluation is done by manually segmenting the job description dataset into different levels of experience required, and building word clouds of the analyzed vocabulary to see if they line up with expectations. All in all, a very useful refresher for things we tend to take for granted with readily available text processing toolkits. What I liked most about this tutorial is that I came away with a subset of tools and ideas that I could use to analyze a vocabulary end-to-end. The github repository for the tutorial is available in case you want to follow along.

Day 2 (July 6, 2017)

Morning Keynote - Data Science in Production - Katrina Reihl

Katrina Riehl of gave the keynote. As an experienced data scientist, she recounted her time in defense and other industries before she arrived at HomeAway, and the ML challenges she works on here. One thing she touched upon are problems with deployment of ML solutions - their main problem is that Python programs are generally not as performant as Java or C/C++ based solutions. Initially they would build and train a Python model, then hand convert to Java or C/C++. Later they looked at PMML - the idea was to train a Python model, then specify its weights and structure and use PMML to instantiate the identical model in Java for production. But this didn't work because of limited availability of PMML aware models and because models in different toolkits have minor differences which break the interop. So finally they settled on microservices - build models in Python, wrap them in microservices, load balance multiple microservices, and consume them from Java based production code. They use protocol buffers for high performance RPC.

Using Scattertext and the Python NLP Ecosystem for Text Visualization - Jason Kessler

Jason Kessler talks about his Scattertext package, which is designed to visualize how different words are used in different classes in a dataset. The visualization is for words or phrases mentioned by Democrats or Republicans during the 2012 elections. He uses a measure called scaled F-score, which achieves a very nice separation of words. He shows other ways you can drill down deeper into the word associations using the Scattertext API. Overall quite an interesting way to visualize word associations. Jason will also present ScatterText at ACL 2017, here is a link to his paper.

Automatic Citation generation with Natural Language Processing - Claire Kelley and Sarah Kelley

The presenters described two methods for finding similar patents using the US Patent database. The first approach vectorizes the documents using TF-IDF and uses cosine similarity as the similarity metric to find the 20 most similar patents for each patent. The results are compared with the patents already cited, and on average, 10 of the 20 suggested similar patents are already cited in the original patent. The second approach tries to build a recommender by factorizing a patent/citation co-occurrence matrix using the Alternating Least Squares (ALS) method for Collaborative Filtering to generate a ranked list of patent recommendations for each patent. The number of latent factors used was 60. Because recommendations are not necessarily similar documents, an objective evaluation using cited patents is not possible, but the recommendations were found to be quite good when spot checked for a small set of patents. Both approaches work on subsets of the patent dataset that is optimized for the category under investigation. Most of the data ingestion and pre-processing was done using Google BigQuery and the ML work was done using Spark ML.

Online Change Point Detection with Spark Streaming - Michal Monselise

The presenter describes a method that she used to find temperature anomalies in a stream of temperature data. The idea is to look at a fixed size window and fit a distribution to it. An anomaly is detected when a window is encountered whose distribution does not have the same parameters as the fitted distribution of the previous window.

Afternoon Keynote: PyData 101 - Jake Vanderplas

Jake Vanderplas is well known in the open source Python community, and his keynote covered the evolution of Python from a scripting language (replacement for bash), a platform for scientific computing (replacement for MATLAB) and now a platform for data science (replacement for R). He also covered the tools that someone who wants to do data science in Python should look at. Many of these are familiar - numpy, pandas, scikit-learn, numba, cython, etc, and some that were not, for example dask. He also briefly touched upon Python as the de-facto language for most deep learning toolkits. I thought this talk was both interesting and useful. Even though I was familiar with many of the packages he listed, I came away learning about a couple I didn't and that I think might be good for me to check out.

In-database Machine Learning with Python in SQL Server - Sumit Kumar

Sumit Kumar of Microsoft showed off new functionality in Microsoft SQL Server that allows you to embed a trained Python machine learning model inside a stored procedure. Unlike the traditional model of pulling data out of the database and then running your trained model on it and writing back the predictions, this approach allows you to run the model on the same server as the database, minimizing network traffic. Microsoft also has tooling in its IDE that loads/reloads the model code automatically during development. The Python code is run in its own virtual machine separate from the database, so problems with the model will not crash the server.

Applying the four step "Embed, Encode, Attend, Predict" framework for text classification and similarity - Sujit Pal

This was my presentation. I spoke about the 4-step recipe for Natural Language Processing (NLP) proposed by Matthew Honnibal, creator of the SpaCy NLP toolkit, and described three applications around document classification, document similarity and sentence similarity, where I used this recipe. I also covered Attention in some depth. You can find the code and the slides for the talk at these links. I used the Keras deep learning toolkit for my models, so of the four steps, only the Attend step does not correspond directly to a Keras provided layer. I plan to write in more detail about my Attention implementations in a subsequent blog post.

I was pleasantly surprised at the many insightful questions I got from the audience during and after the talk. I also made a few friends and had detailed conversations around transfer learning, among other things. I also got a very nice demo of a system which automatically learns taxonomies from text which I thought was very interesting.

Chatbots - Past, Present and Future - Dr Rutu Mulkar-Mehta

This was a fairly high level talk but very interesting to me, since I know almost nothing about chatbots and because it may be one of the most obvious places to use NLP in. Typically chatbot designers tend to outsource the NLP analysis and concentrate on the domain expertise, so a number of chatbot platforms have come up that cater to this need, with varying degrees of sophistication. Some examples are Chatterbot, API.AI,, etc. She talked about the need to extract features from the incoming text in order to feed machine learning classifiers at each stage in the chatbot pipeline for it to decide how to respond. In all, a nice introduction to chatbots, seen as a pipeline of NLP and ML components.

PyData "Pub" Quiz - Steve Dower and James Powell

To end the day, we had a 6-part quiz on Python, conducted by the inimitable James Powell. Many of the questions had to do with Python 3 features, Monty Python, and esoteric aspects of the Python language, so not surprisingly, I did not do too well. About the only thing I could answer were the two features of Python 3 that I always import from __future__ - the print_function and Python 3 style division, and some calls in matplotlib and scikit-learn. But I did learn a lot of things that I didn't know before, always a good thing.

There was a social event after this with food and drink. I hung around for a while and had some interesting conversations, then decided to call it a day.

Day 3 (July 7, 2017)

Morning Keynote - Accelerating AI Development - Joseph Sirosh

Joseph Sirosh of Microsoft talked about all the cool things that Microsoft is doing with Python and Machine Learning. He brought in various Microsoft employees to do brief demos of their work.

Medical Image Processing using Microsoft Deep Learning Framework (CNTK) - Naoto Usuyama and Jessica Lundin

Jessica started the presentation off by talking about a newly created Health division inside Microsoft that works as a startup within the parent company. This startup is doing many cool things in the Health and Medical spaces. After that Naoto talked about how he used CNTK to train models for the Diabetes Retinopathy and Lung Cancer challenges from Kaggle. His notebooks for both challenges are available on his pydata-medical-image repository on Github. I had been curious about CNTK but had never seen it in action, so this was interesting to me. Also I found the approach to preprocessing of the Lung Cancer dataset (DICOM) images interesting.

Learn to be a painter using Neural Style Painting - Pramit Choudhary

I find the whole idea of using neural networks to produce images that look like LSD induced hallucinations quite fascinating. However, while building models that generate these images, there have been certain aspects which I have kind of glossed over. One of the things the presenter briefly touched upon were the transformations on the content and style images before they are merged - this was one of those things I had glossed over earlier, so I figured I will ask him, but there was no time, so I decided to look it up myself, and this video of Leon Gatys's presentation at CVPR 2016 provides the clearest description I have seen so far. The presenter went on to explain how he used Spark to optimize the hyperparameters for the style transfer. He also gave me a jacket (his employer) for answering a very simple question correctly.

Scaling Scikit-Learn - Stephen Hoover

The presenter described the data science toolbox his company Civis Analytics markets and the reasoning behind various features. The toolbox is based on the AWS components Redshift, S3 and EC2. They have made some custom GridSearch components that leverages these different components. He also briefly described how one can build their own joblib based custom implementation of parallel_backend. The solution appears to be optimized for parallel batch operation against trained models during prediction. I thought it might be somewhat narrow in its scope, and based on a question from the audience at the end of the talk, think that it may be worth taking a look at Dask to see if some scaling issues may be solved more generally using it.

There were some more talks after lunch which I had to miss because I needed to catch my flight home, since I had already committed (before the talk proposal was accepted) to my family that I would be somewhere else starting Friday evening and through the weekend. So I will watch the videos for the ones I missed once the organizers release them. I will also update the links on this post once that happens.

Edit 2017-07-24 - Added links to recordings of the various talks I attended, from combination of PyData Seattle 2017 Youtube playlist and smaller set of links on MSDN Channel 9. Strangely enough, the two CNTK talks don't seem to have made it into the playlist, I found Naoto and Jessica's talk by searching on Youtube, and I couldn't find Dave DeBarr's talk on CNTK's Python interface. Maybe it's just an oversight.

Sunday, June 18, 2017

Trip Report: Graph Day SF 2017

Yesterday I attended the Graph Day SF 2017 conference. Lately, my interest in graphs have been around Knowledge Graphs. Last year, I worked on a project that used an existing knowledge graph and entity-pair versus relations co-occurrences across a large body of text to predict new relations from the text. Although we modeled the co-occurrence as a matrix instead of a graph, I was hoping to learn techniques that I could apply to graphs. One other area of recent interest is learning how to handle large graphs.

So anyway, that was why I went. In this post, I describe the talks I attended. The conference was 1 day only, and there were 4 parallel tracks of very deep, awesome talks, and there were at least 2 that I would have liked to attend but couldn't because I had to make a choice.

Keynote - Marko Rodriguez, Datastax.

I have always thought of graph people as being somewhat more intellectual than mere programmers, starting with the classes I took at college. The keynote kind of confirms this characterization. The object of the talk was to refute the common assertion by graph people that everything is a graph. The speaker does this by showing that a graph can be thought of structurally, as a collection of vertices and edges, and also as process, as a collection of functions and streams. Differentiating a graph repeatedly oscillates between the two representation, leading to the conclusion that a graph is infinitely differentiable. Here is the paper on which the talk is based, and here are the slides.

Time for a new Relation: going from RDBMS to graph - Patrick McFadin, Datastax

This talk was decidedly less highbrow compared to the keynote, focusing on why one might want to move from relational to the graph paradigm. The speaker has lots of experience in RDBMS and Tabular NoSQL databases (Cassandra), and is currently making the shift to graph databases. One key insight is that he classifies the different types of database technology in a continuum - Key Value stores, Tabular NoSQL databases, Document NoSQL databases, RDBMS, graph databases. Also, he differentiates bwtween the strengths of an RDBMS and a that of a graph databases as follows - the RDBMS makes it easy to describe relations, but the graph database makes it easy to find relations. He also looks at Property Graphs as possible drop-in replacements for RDBMS tables. He also pointed out a free learning resource DS330: DataStax Enterprise Graph, which seems likely to be product specific, although the introductory video suggests that there is some product agnostic content around data modeling.

Comparing Giraph and GraphX - Jenny Zhao, Drawbridge

Drawbridge's business is to disambiguate your devices from other people's, using their activity logs. In this particular presentation, they describe how they switched from using map-reduce for their feature selection process to using Apache Giraph and saved about 8 hours of processing time. Instead of writing out the pair data, then doing a pairwise compare followed by a pairwise join, they ingest the paired data as a graph and compute distances on the edges to find the best pairs for their downstream process. They also tried Spark GraphX but they found it doesn't scale as well to large data volumes. Code using GraphX and Giraph are also shown to highlight an important difference between the two.

Graphs in Genomics - Jason Chin, Pacific Biosciences

Interesting presentation about the use of graphs in the field of genomics. The human genome is currently not readable in its entirety, so it cut into many peieces of random length and resequenced. One possibility is to represent it as 23 bipartite graphs, one for each of our 23 chromosomes. Presentation then focuses on how researchers use graph theory to fill in gaps between the peices of the genome. Here is a link to an older presentation by the same presenter which covers much of the same material as this talk, I will update with the current presentation when it becomes available.

Knowledge Graph in Watson Discovery - Anshu Jain and Nidhi Rajshree, IBM

The talk focuses on lessons learned while the presenters were building the knowledge graph for IBM Watson. I thought this was a good mix of practical ideas and theory. Few things that I found particularly noteworthy was including suprise as a parameter - the user can specify a parameter that indicates his willingness to see serendipitous results. Another one is keeping the Knowledge Graph lighter and using it to finetune queries at runtime (local context) rather than baking it in during creating time (global context). Thus you are using the Knowledge graph itself as a context vector. Yet another idea is using Mutual Information as a similarity metric for the element of surprise (useful in intelligence and legal work) since it treats noise equally for both documents. Here is the link to the presentation slides.

A Cognitive Knowledge Base as an Enterprise Database - Haikal Pribadi, GRAKN.AI

The presenter showcases his product GRAKN.AI (sounds like Kraken), which is a distributed knowledge base with a reasoning query language. It was awarded product of the year for 2017 by University of Cambridge Computer Lab. It has a unified syntax that allows you to define and populate a graph and then query it. The query language feels a bit like Prolog, but much more readable. It is open source and free to use. I was quite impressed with this product and hope to try it soon. One other thing I noted in his presentation was the use of the DeepDive project for knowledge acquisition, which is a nice confirmation since I am looking at it's sister project snorkel for a similar use case.

Graph Based Taxonomy Generation - Rob McDaniel, LiveStories

The presenter describes building taxonomy from queries. The resulting taxonomies are focused on a small area of knowledge, and can be useful for building custom taxonomies for applications focused on a specific domain. Examples mentioned in the presentation were “health care costs” and “poisoning deaths”, produced as a result of using his approach. The idea is to take a group of (manually created) seed queries about a given subject and hit some given search engine using an API and collect the top N documents for each query. You then do topic modeling on these documents and generate a document-topic co-ocurrence graph (using only topics that have p(topic|document) above a certain threshold). You then partition the graph into subgraphs using an iterative partitioning strategy of coarsening, bisecting and un-coarsening. The graph partitioning algorithm covered in the presentation was Heavy Edge Matching, but other partitioning algorithms could be used as well. Once the partitions are stable, the node with the highest degree of connectedness in each partition becomes the root level element in the taxonomy. This node is then removed from the subgraph and the subgraph partitioned recursively again into its own subgraphs, until the number of topics in a partition is less than some threshold. The presentation slides and code are available.

Project Konigsburg: A Graph AI - Gunnar Kleemann and Denis Vrdoljak, Berkeley Data Science Group

The presenters describe a similarity metric based on counting triangles and wedges (subgraph motifs) that seems to work better with connected elements in a graph than more traditional metrics. They use this approach to rank features for feature selection. They have used this metric to build a and rank academics from a citation network extracted from Pubmed. They have also used this metric in several applications that focus on recruiting from the applicant side (resume building, finding the job that best suits your profile, etc).

Knowledge Graph Platform: Going beyond the database - Michael Grove, Stardog

This was a slightly high level talk by the CTO of Stardog. He outlined what people generally think about when they say Enterprise Knowledge Graph Platforms and the common fallacies in these definitions.

Two presentations I missed because there were 4 tracks going on at the same time, and I had to choose between two awesome presentations going on at the same time.

  • DGraph: A native, distributed graph database - Manish Jain, Dgraph Labs.
  • Start Flying with Apache and Tinkerpop - Jason Plurad, IBM

Overall, I thought the conference had really good talks, the venue was excellent, and the event was very well organized. There was no breakfast or snacks, but there was coffee and tea, and the lunch was delicious. One thing I noticed was the absence of video recording, so unfortunately there is not going to be any videos of these talks. There were quite a few booths, mostly graph database vendors. I learned quite a few things here, although I might have learned more if the conference was spread over 2 days and had 2 parallel tracks instead of 4.

Saturday, May 20, 2017

Evaluating a Simple but Tough to Beat Embedding via Text Classification

Recently, a colleague and a reader of this blog independently sent me a link to the Simple but Tough-to-Beat Baseline for Sentence Embeddings (PDF) paper by Sanjeev Arora, Yingyu Liang, and Tengyu Ma. My reader also mentioned that the paper was selected for a mini-review in Lecture 2 of the Natural Language Processing and Deep Learning (CS 224N) course taught at Stanford University by Prof Chris Manning and Richard Socher. For those of you who have taken Stanford's earlier Deep Learning and NLP (CS 224d) course taught by Socher, or the very first course Coursera on Natural Language Processing by Profs Dan Jurafsky and Chris Manning, you will find elements from both in here. There are also some things I think are new or that I might have missed earlier.

The paper introduces an unsupervised scheme for generating sentence embeddings that has been shown to consistently outperform a simple Bag of Words (BoW) approach in a number of evaluation scenarios. The evaluation scenarios considered are both intrinsic (correlating computed similarities of sentence embeddings with human estimates of similarity) as well as extrinsic (using the embeddings for a downstream classification task). I thought the idea was very exciting, since all the techniques I have used to convert word embeddings to sentence embeddings have given results consistent with the complexity used to produce them. At the very low end is the BoW approach, which adds up the embedding vectors for the individual words and averages them over the sentence length. At the other end of the scale is to generate sentence vectors from a sequence of word vectors by training a LSTM and then using it, or by looking up sentence vectors using a trained skip-thoughts encoder.

The Smooth Inverse Frequency (SIF) embedding approach suggested by the paper is only slightly more complicated than the BoW approach, and promises consistently better results than BoW. So for those of us who used the BoW as a baseline, this suggests that we should now use SIF embedding instead. So instead of just averaging the component word vectors as suggested by this equation for BoW:

We generate the sentence vector vs by multiplying each component vector vw by the inverse of its probability of occurrence. Here α is a smoothing constant, its default value as suggested in the paper is 0.001. We then sum these normalized smoothed word vectors and divide by the number of words.

Since we do this for all the sentences in our corpus, we now have a matrix where the number of rows is the number of sentences and the number of columns is the embedding size (typically 300). Removing the first principal component from this matrix gives us our sentence embedding. There is also an implementation of this embedding scheme in the YingyuLiang/SIF GitHub repository.

For my experiment, I decided to compare BoW and SIF vectors by how effective they are when used for text classification. My task is to classify images as compound (i.e, composed of multiple sub-images) versus non-compound (single image, no sub-images) using only the captions. The data comes from the ImageCLEF 2016 (Medical) competition, where Compound Figure Detection is the very first task in the task pipeline. The provided dataset has 21,000 training captions, each about 92 words long on average, and split roughly equally between the two classes. The dataset also contains 3,456 test captions (labels provided for validation purposes).

The label and captions are provided as two separate files, for both training and test datasets. Here is an example of what the labels file looks like:


and the captions files look like this:

12178_2007_9003_Fig1_HTML       An 64-year-old female with symptoms of bilateral lower limb neurogenic claudication with symptomatic improvement with a caudal epidural steroid injection. An interlaminar approach could have been considered appropriate, as well. ( a ) Sagittal view of a T2-weighted MRI of the lumbar spine. Note the grade I spondylolisthesis of L4 on L5 with severe central canal stenosis. ( b ) and ( c ) Axial views of a T2-weighted MRI through L4 รข<80><93> 5. Note the diffuse disc bulge in ( b ) and the marked ligamentum flavum hypertophy in ( c ), both contributing to the severe central stenosis. ( d ) The L5-S1 level showing no evidence of stenosis
12178_2007_9003_Fig3_HTML       Fluoroscopic images of an L3-4 interlaminar approach. ( a ) AP view, pre-contrast, ( b ) Lateral view, pre-contrast, and ( c ) Lateral view, post-contrast
12178_2007_9003_Fig5_HTML       Fluoroscopic images of a right L5-S1 transforaminal approach targeting the right L5 nerve root. ( a ) AP view, pre-contrast and ( b ) AP view, post-contrast

I built BoW and SIF vectors for the entire dataset, using GloVe word vectors. I then used these vectors as inputs to stock Scikit-Learn Naive Bayes and Support Vector Machine classifiers, and measured the test accuracy for various vocabulary sizes. For the word probabilities, I used both native probabilities (i.e, computed from the combined caption dataset) and outside probabilities (computed from Wikipedia, and available in the YingyuLiang/SIF GitHub repository). I then built vocabularies out of the most common N words, computed BoW sentence embeddings, SIF sentence embeddings with native word frequencies, and SIF sentence embeddings with external probabilities (SIF+EP), and recorded the accuracy reported for the two class classification task from the Naive Bayes and Support Vector Machine (SVM) classifiers. Below I provide a breakdown of the steps wtih code.

The first step is to parse the files and generate a list of training and test captions with their labels.

def parse_caption_and_label(caption_file, label_file, sep=" "):
    filename2label = {}
    flabel = open(label_file, "rb")
    for line in flabel:
        filename, label = line.strip().split(sep)
        filename2label[filename] = LABEL2ID[label]
    fcaption = open(caption_file, "rb")
    captions, labels = [], []
    for line in fcaption:
        filename, caption = line.strip().split("\t")
    return captions, labels

TRAIN_CAPTIONS = "/path/to/training-captions.tsv"
TRAIN_LABELS = "/path/to/training-labels.csv"
TEST_CAPTIONS = "/path/to/test-captions.tsv"
TEST_LABELS = "/path/to/test-labels.csv"
LABEL2ID = {"COMP": 0, "NOCOMP": 1}

captions_train, labels_train = parse_caption_and_label(
captions_test, labels_test = parse_caption_and_label(

Next I build the word count matrix using the captions. For this we use the Scikit-Learn CountVectorizer to do the heavy lifting. We have removed stopwords from the counting using the stopwords parameter. At this point Xc is a matrix of word counts of shape (number of training records + number of test records, VOCAB_SIZE). The VOCAB_SIZE is a hyperparameter which we will vary during our experiments.

from sklearn.feature_extraction.text import CountVectorizer

VOCAB_SIZE = 10000
counter = CountVectorizer(strip_accents=unicode, 
caption_texts = captions_train + captions_test
Xc = counter.fit_transform(caption_texts).todense().astype("float")

At this point, we can capture the sentence length vector S (see the formulae for vs as the sum across the columns of this matrix).

import numpy as np

sent_lens = np.sum(Xc, axis=1).astype("float")
sent_lens[sent_lens == 0] = 1e-14  # prevent divide by zero

Next we read the pretrained word vectors from the provided GloVe embedding file. We use the version built with Wikipedia 2014 + Gigaword 5 (6B tokens, 400K words and dimensionality 300). The following snippet extracts the vectors for the words in our vocabulary and collects them into a dictionary.

E = np.zeros((VOCAB_SIZE, 300))
fglove = open(GLOVE_EMBEDDINGS, "rb")
for line in fglove:
    cols = line.strip().split(" ")
    word = cols[0]
        i = counter.vocabulary_[word]
        E[i] = np.array([float(x) for x in cols[1:]])
    except KeyError:

We are now ready to build our BoW vectors. Replacing the term counts with the appropriate vector is just a matrix multiplication, and averaging by word length means an element-wise divide by the S vector. Finally we split our BoW sentence embeddings into training and test splits.

Xb = np.divide(, E), sent_lens)

Xtrain, Xtest = Xb[0:len(captions_train)], Xb[-len(captions_test):]
ytrain, ytest = np.array(labels_train), np.array(labels_test)

The regularity of the Scikit-Learn API means that we can build some functions that can be used to cross-validate our classifier during training and evaluate it with the test data.

from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

def cross_val(Xtrain, ytrain, clf):
    best_clf = None
    best_score = 0.0
    num_folds = 0
    cv_scores = []
    kfold = KFold(n_splits=10)
    for train, val in kfold.split(Xtrain):
        Xctrain, Xctest, yctrain, yctest = Xtrain[train], Xtrain[val], ytrain[train], ytrain[val], yctrain)
        score = clf.score(Xctest, yctest)
        if score > best_score:
            best_score = score
            best_clf = clf
        print("fold {:d}, score: {:.3f}".format(num_folds, score))
        num_folds += 1
    return best_clf, cv_scores

def test_eval(Xtest, ytest, clf):
    print("Test set results")
    ytest_ = clf.predict(Xtest)
    accuracy = accuracy_score(ytest, ytest_)
    print("Accuracy: {:.3f}".format(accuracy))

We now invoke these functions to instantiate a Naive Bayes and SVM classifier, train it with 10-fold cross validation on the training split, and evaluate it with the test data to produce, among other things, a test accuracy. The following code shows the call for doing this with a Naive Bayes classifier. The code for doing this with an SVM classifier is similar.

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
best_clf, scores_nb = cross_val(Xtrain, ytrain, clf)
test_eval(Xtest, ytest, best_clf)

The SIF sentence embeddings also start with the count matrix generated by the CountVectorizer. In addition, we need to compute the word probabilities. If we want to use the word probabilities from the dataset, we can do so by computing the row sum of the count matrix as follows:

# compute word probabilities from corpus
freqs = np.sum(Xc, axis=0).astype("float")
probs = freqs / np.sum(freqs)

We could also get these word probabilities from some external source such as a file. So given the probs vector, we can create a vector representing the coefficient for each word. Something like this:

ALPHA = 1e-3
coeff = ALPHA / (ALPHA + probs)

We can then compute the raw sentence embedding matrix in a manner similar to the BoW matrix.

Xw = np.multiply(Xc, coeff)
Xs = np.divide(, E), sent_lens)

In order to remove the first principal component, we first compute it using the TruncatedSVD class from Scikit-Learn, and subtract it from the raw SIF embedding Xs.

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=1, n_iter=20, random_state=0)
pc = svd.components_
Xr = Xs -

As with the BoW sentence embeddings, we split it back to a training and test set, and train the two classifiers and evaluate them.

The full code used for this post is available in this GitHub gist as a Jupyter notebook. The results of running the three sentence embeddings - BoW, SIF and SIF with External Word Probabilities (SIF+EP) - through the two stock Sciki-Learn classifiers for different vocabulary sizes are shown below.

As you can see, I get conflicting results for the two classifiers. For the Naive Bayes classifier, SIF sentence embeddings with native word probabilities narrowly beats out the BoW embeddings, whereas in case of SVM, the SIF embeddings with external word probabilities are slightly better than the BoW results for some vocabulary sizes. Also, accuracies from the other SIF embedding trails the ones from BoW in both cases. Finally, the differences are really minor - if you look at the y-axis on the charts, you will see that the difference is on the third decimal place. So at least based on my experiment, there does not seem to be a significant utility to use the SIF embeddings over the BoW.

My use case does differ from the ideal case in that my captions can be long (longer than a typical sentence) and/or multi-sentence. Also, for the embedding I used the GloVe vectors computed against the 6B corpus, the YingyuLiang/SIF implementation used vectors generated from the 84B corpus. I don't believe these should make too much difference, but I may be wrong. I have tried to follow the paper recommendations as closely as possible when replicating this experiment, but it is possible I may have made a mistake somewhere - in case you spot it please let me know. The code is included, both on this post as well as in the GitHub gist if you want to verify that it works like I described. As a user of word and sentence embeddings, my primary use case is to use them to encode text input to classifiers. If you have gotten results that indicate SIF sentence embeddings are significantly better than BoW sentence embeddings for this or a similar use case, please let me know.