Saturday, July 22, 2017

The Benefits of Attention for Document Classification

A couple of weeks ago, I presented Embed, Encode, Attend, Predict - applying the 4 step NLP recipe for text classification and similarity at PyData Seattle 2017. The talk itself was inspired by the Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models blog post by Matthew Honnibal, creator of the spaCy Natural Language Processing (NLP) Python toolkit. In it, he posits that any NLP pipeline can be constructed from these 4 basic operations and provides examples from two of his use cases. In my presentation, I use his recipe to construct deep learning pipelines for two other processes - document classification and text similarity.

Now I realize that it might seem a bit pathetic to write a blog post about a presentation about someone else's blog post. But the reason I even came up with the idea for the presentation was because Honnibal's idea of using these higher level building blocks struck me as being so insightful and generalizable that I figured that it would be interesting to use it on my own use cases. And I decided to do the blog post because I thought that the general idea of abstracting a pipeline using these 4 steps would be useful to people beyond those who attended my talk. I also hope to provide a more in-depth look at the Attend step here than I could during the talk due to time constraints.

Today, I cover only my first use case of document classification. As those of you who attended my talk would recall, I did not get very good results for the second and third use cases around document and text similarity. I have a few ideas that I am exploring at the moment. If they are successful, I will talk about them in a future post.

The 4 step recipe

For those of you who are not aware of the 4-step recipe, I refer you to Honnibal's original blog post for the details. But if you would rather just get a quick refresher, the 4 steps are as follows:

  • Embed - converts an integer into a vector. For example, a sequence of words can be transformed through vocabulary lookup to a sequence of integers, each of which could be transformed into a fixed size vector represented by the word embedding looked up from third party embeddings such as word2vec or GloVe.
  • Encode - converts a sequence of vectors into a matrix. For example, a sequence of vectors representing some sequence of words such as a sentence, could be sent through a bi-directional LSTM to produce a sentence matrix.
  • Attend - reduces the matrix to a vector. This can be done by passing the matrix into an Attention mechanism that captures the most salient features of the matrix, thus minimizing the information loss during reduction.
  • Predict - reduces a vector to a integer label. This would correspond to a fully connected prediction layer that takes a vector as input and returns a single classification label.

Of these steps, all but the Attend step is adequately implemented by most Deep Learning toolkits. My examples use Keras, a Python deep learning library. In Keras, the Embed step is represented by the Embedding layer where you initialize the weights from an external embedding; the Encode step can be implemented using a LSTM layer wrapped in a Bidirectional wrapper; and the Predict step is implemented with a Dense layer.

Experiment: Document Classification

These steps can be thought of as large logical building blocks for our NLP pipeline. A pipeline can be composed of zero or more of these steps. It is also important to realize that each of these steps has a naive, non deep learning equivalent. For example, the Embed step can be done using one-hot vectors instead of third party word embeddings; the Encode step can be done by just concatenating the vectors along their short axis; the Attend step can be done by averaging the component word vectors; and the Predict step can use an algorithm other than deep learning. Since I wanted to see the effect of each of these steps separately, I conducted the following set of experiments - the links lead out to Jupyter notebooks on Github.

The data for this experiment comes from the Reuters 20 newsgroups dataset. It comes as part of scikit-learn's datasets package. It is a collection of 180000 newsgroup postings pre-categorized into one of 20 newsgroups. Our objective is to build a classifier (or classifiers) that can predict the document's newsgroup category from its text.

  • Embed and Predict (EP) - Here I treat a sentence as a bag of words and a document as a bag of sentences. So a word vector is created by looking it up against a GloVe embedding, a sentence vector is created by averaging its word vectors, and a document vector is created by averaging its sentence vectors. The resulting document vector is fed into a 2 layer Dense network to produce a prediction of one of 20 class.
  • Embed, Encode and Predict (EEP) - We use a document classification hierarchy as described in this paper by Yang, et al.[1]. Specifically, a sentence encoder is created that transforms integer sequences (from words in sentences) into a sequence of word vectors by looking up GloVe embeddings, then converts the sequence of word vectors to a sentence vector by passing it through a Bidirectional LSTM and capturing the context vector. This sentence encoder is embedded into the document network, which takes in a sequence of sequence of integers (representing a sequence of sentences or a document). The sentence vectors are passed into a Bidirectional LSTM encoder that outputs a document vector, again by returning only the context vector. This document vector is fed into a 2 layer Dense network to produce a category prediction.
  • Embed, Encode, Attend and Predict #1 (EEAP#1) - In this network, we add an Attention layer in the sentence encoder as well as in the Document classification network. Unlike the previous network, the Bidirectional LSTM in either network returns the full sequences, which are then reduced by the Attention layer. This layer is of the first type as described below. Output of the document encoding is a document vector as before, so as before it is fed into a 2 layer Dense network to produce a category prediction.
  • Embed, Encode, Attend and Predict #2 (EEAP#2) - The only difference between this network and the previous one is the use of the second type of Attention mechanism as described in more detail below.
  • Embed, Encode, Attend and Predict #3 (EEAP#3) - The only difference between this network and the previous one is the use of the third type of Attention mechanism. Here the Attention layer is fed with the output of the Bidirectional LSTM as well as the output of a max pool operation on the sequence to capture the most important parts of the encoding output.

The results of the experiment are as follows. The interesting values are the blue bars, that represent the accuracy reported by each trained model on the 30% held out test set. As you would expect, the Bag of Words (EP) approach yields the worst results, around 71.4%, which goes up to 77% once we replace the naive encoding with a Bidirectional LSTM (EEP). All the models with Attention outperform these two models, and the best result is around 82.4% accuracy with the first Attend layer (EEAP#1).

Attention Mechanisms

I think one reason Keras doesn't provide an implementation of Attention is because different researchers have proposed slightly different variations. For example, the only toolkit I know that offers Attention implementations is Tensorflow (LuongAttention and BahdanauAttention), but both are in the narrower context of seq2seq models. Perhaps a generalized Attention layer is just not worth the trouble given all the variations and maybe it is preferable to build custom one-offs yourself. In any case, I ended up spending quite a bit of time understanding how Attention worked and how to implement it myself, which I hope to also share with you in this post.

Honnibal's blog post also offers a taxonomy of different kinds of attention. Recall that the Attend step is a reduce operation, converting a matrix to a vector, so the following configurations are possible.

  • Matrix to Vector - proposed by Raffel, et al.[2]
  • Matrix to Vector (with implicit context) - proposed by Lin, et al.[3]
  • Matrix + Vector to Vector - proposed by Cho, et al.[4]
  • Matrix + Matrix to Vector - proposed by Parikh, et al.[5]

Of these, I will cover the first three here since they were used for the document classification example. References to the papers where these were propsed are provided at the end of the post. I have tried to normalize the notation across these papers so it is easier to talk about them in relation with each other.

I ended up implementing them as custom layers, although in hindsight, I could probably have used Keras layers to compose them as well. However, that approach can be complex if your attention mechanism is complicated. If you want an example of how to do that, take a look at Spacy's implementation of decomposable attention used for sentence entailment.

There are many blog posts and articles that talk about how Attention works. By far the best one I have seen is this one from Heuritech. Essentially, the Attention process involves combining the input signal (a matrix) with some other signal (a vector) to find an alignment that tells us which parts of the input signal we should pay attention to. The alignment is then combined with the input signal to produce the attended output. Personally, I have found that it helps to look at a flow diagram to see how the signals are combined, and the equations to figure out how to implement the layer.

Matrix to Vector (Raffel)

This mechanism is a pure reduction operation. The input signal is passed through a tanh and a softmax to produce an alignment matrix. The dot product of the alignment and the input signal is the attended output.

Two things to note here is the presence of the learnable weights W and b. The idea is that the component will learn these values so as to align the input based on the task it is being trained for.

The code for this layer can be found in class AttentionM in the custom layer code.

Matrix to Vector (Lin)

This mechanism is also a pure reduction operation, since the input to the layer is a matrix and the output is a vector. However, unlike the previous mechanism, it learns an implicit context vector u, in addition to W and b, as part of the training process. You can see this by the presence of a u vector entering the softmax and in the formula for αt.

Code for this Attention class can be found in the AttentionMC class in the custom layer code.

Matrix + Vector to Vector (Cho)

Unlike the previous two mechanisms, this takes an additional context vector that is explicitly provided along with the input signal matrix from the Encode step. This can be a vector that is generated by some external means that is somehow representative of the input. In our case, I just took the max pool of the input matrix along the time dimension. The process of creating the alignment vector is the same as the first mechanism. However, there is now an additional weight that learns how much weight to give to the provided context vector, in addition to the weights W and b.

Code for this Attention class can be found in the AttentionMV class in the code for the custom layers.

As you may have noticed, the code for the various custom layers is fairly repetitive. We declare the weights in the build() method and the computations with the weights and signals in the call() method. In addition, we support input masking via the presence of the compute_mask() method. The get_config() method is needed when trying to save and load the model. Keras provides some guidance on building custom layers, but a lot of the information is scattered around in Keras issues and various blog posts. The Keras website is notable, among other things, for the quality of its documentation, but somehow custom layers haven't received the same kind of love and attention. I am guessing that perhaps it is because this is closer to the internals and hence more changeable, so harder to maintain, and also once you are doing custom layers, you are expected to be able to read the code yourself.

So there you have it. This is Honnibal's 4-step recipe for deep learning NLP pipelines, and how I used it for one of the use cases I talked about at PyData. I hope you found the information about Attention and how to create your own Attention implementations useful.


  1. Yang, Z, et al (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT (pp. 1480-1489).
  2. Raffel, C, & Ellis, D. P (2015). Feed-forward networks with attention can solve some long term memory problems. arXiv preprint arXiv:1512.08756.
  3. Lin, Z., et al. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
  4. Cho, K, et al. (2015). Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886.
  5. Parikh, A. P., et al (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.

Wednesday, July 12, 2017

Trip Report: PyData Seattle 2017

Last week I attended (and presented at) PyData Seattle 2017. Over time, Python has morphed from a scripting language, to a library for scientific computing, and lately pretty much a standard language for most aspects of Machine Learning (ML) and Artificial Intelligence (AI), including Deep Learning (DL). PyData conferences cater mostly to the last demographic. Even though it's not really a place where you go to learn about the state of the art, it's still a great place to catch up with what others in industry are doing with Python and ML/AI/DL. PyData conferences are usually 2-3 day affairs, and they happen multiple times a year, at different places all over the world, organized by local groups of Python/ML enthusiasts.

The conference was 3 days long, one day of tutorials followed by 2 days of presentations. It was held at the Microsoft Campus - along with the conference center, Microsoft also sponsored the food. I stayed at the Hyatt Regency Bellevue, their "preferred" hotel - initially I thought that it meant they would have a shuttle service to and from the conference, but it was because of lower negotiated room rates for conference attendees. But thanks to ridesharing services such as Lyft, I had no problems getting around.

So anyway, here is my trip report... there were 4 simultaneous tracks so there are things I missed because I wanted to attend something even better. Fortunately, there are videos of the talks and the organizers are collecting the slides from the speakers, and all of this will be made available in 2-3 weeks. I will update the links to slides and video when that happens.

Day 1 (July 5, 2017)

pomegranate: fast and flexible probabilisting modeling in python - Maxwell W Libbrecht

I first came across the pomegranate library at PyData Amsterdam last year, where it was mentioned as a package containing several probabilistic graphical models (PGMs), and specifically to build Bayesian Networks. I read the docs in advance, and it turns out that it also contains Hidden Markov Models, General Mixture Models and Factor Graphs. The talk itself was mostly a walkthrough of its capabilites using this notebook. Like many other ML packages in the Python (and other) ecosystem, pomegranate's API is modeled after that of scikit-learn. The examples were very cool, and left me itching for a problem that I might be able to solve using pomegranate :-).

Vocabulary Analysis of Job Descriptions - Alex Thomas

Alex Thomas of led us through analyzing the vocabulary of job descriptions. The objective is to extract attributes from the text of job descriptions, which might be used as structured features for these descriptions for downstream tasks. He started with basic ideas like TF-IDF, finding multi-word candidates, using stopwords and extending them, stemming and lemmatizing. Evaluation is done by manually segmenting the job description dataset into different levels of experience required, and building word clouds of the analyzed vocabulary to see if they line up with expectations. All in all, a very useful refresher for things we tend to take for granted with readily available text processing toolkits. What I liked most about this tutorial is that I came away with a subset of tools and ideas that I could use to analyze a vocabulary end-to-end. The github repository for the tutorial is available in case you want to follow along.

Day 2 (July 6, 2017)

Morning Keynote - Katrina Reihl

Katrina Riehl of gave the keynote. As an experienced data scientist, she recounted her time in defense and other industries before she arrived at HomeAway, and the ML challenges she works on here. One thing she touched upon are problems with deployment of ML solutions - their main problem is that Python programs are generally not as performant as Java or C/C++ based solutions. Initially they would build and train a Python model, then hand convert to Java or C/C++. Later they looked at PMML - the idea was to train a Python model, then specify its weights and structure and use PMML to instantiate the identical model in Java for production. But this didn't work because of limited availability of PMML aware models and because models in different toolkits have minor differences which break the interop. So finally they settled on microservices - build models in Python, wrap them in microservices, load balance multiple microservices, and consume them from Java based production code. They use protocol buffers for high performance RPC.

Using Scattertext and the Python NLP Ecosystem for Text Visualization - Jason Kessler

Jason Kessler talks about his Scattertext package, which is designed to visualize how different words are used in different classes in a dataset. The visualization is for words or phrases mentioned by Democrats or Republicans during the 2012 elections. He uses a measure called scaled F-score, which achieves a very nice separation of words. He shows other ways you can drill down deeper into the word associations using the Scattertext API. Overall quite an interesting way to visualize word associations. Jason will also present ScatterText at ACL 2017, here is a link to his paper.

Automatic Citation generation with Natural Language Processing - Claire Kelley and Sarah Kelley

The presenters described two methods for finding similar patents using the US Patent database. The first approach vectorizes the documents using TF-IDF and uses cosine similarity as the similarity metric to find the 20 most similar patents for each patent. The results are compared with the patents already cited, and on average, 10 of the 20 suggested similar patents are already cited in the original patent. The second approach tries to build a recommender by factorizing a patent/citation co-occurrence matrix using the Alternating Least Squares (ALS) method for Collaborative Filtering to generate a ranked list of patent recommendations for each patent. The number of latent factors used was 60. Because recommendations are not necessarily similar documents, an objective evaluation using cited patents is not possible, but the recommendations were found to be quite good when spot checked for a small set of patents. Both approaches work on subsets of the patent dataset that is optimized for the category under investigation. Most of the data ingestion and pre-processing was done using Google BigQuery and the ML work was done using Spark ML.

Scan Statistics with Spark Streaming: Distribution based Real-time Anomaly Detection - Michal Monselise

The presenter describes a method that she used to find temperature anomalies in a stream of temperature data. The idea is to look at a fixed size window and fit a distribution to it. An anomaly is detected when a window is encountered whose distribution does not have the same parameters as the fitted distribution of the previous window.

Afternoon Keynote - Jake Vanderplas

Jake Vanderplas is well known in the open source Python community, and his keynote covered the evolution of Python from a scripting language (replacement for bash), a platform for scientific computing (replacement for MATLAB) and now a platform for data science (replacement for R). He also covered the tools that someone who wants to do data science in Python should look at. Many of these are familiar - numpy, pandas, scikit-learn, numba, cython, etc, and some that were not, for example dask. He also briefly touched upon Python as the de-facto language for most deep learning toolkits. I thought this talk was both interesting and useful. Even though I was familiar with many of the packages he listed, I came away learning about a couple I didn't and that I think might be good for me to check out.

In-database Machine Learning with Python in SQL Server - Sumit Kumar

Sumit Kumar of Microsoft showed off new functionality in Microsoft SQL Server that allows you to embed a trained Python machine learning model inside a stored procedure. Unlike the traditional model of pulling data out of the database and then running your trained model on it and writing back the predictions, this approach allows you to run the model on the same server as the database, minimizing network traffic. Microsoft also has tooling in its IDE that loads/reloads the model code automatically during development. The Python code is run in its own virtual machine separate from the database, so problems with the model will not crash the server.

Applying the four step "Embed, Encode, Attend, Predict" framework for text classification and similarity - Sujit Pal

This was my presentation. I spoke about the 4-step recipe for Natural Language Processing (NLP) proposed by Matthew Honnibal, creator of the SpaCy NLP toolkit, and described three applications around document classification, document similarity and sentence similarity, where I used this recipe. I also covered Attention in some depth. You can find the code and the slides for the talk at these links. I used the Keras deep learning toolkit for my models, so of the four steps, only the Attend step does not correspond directly to a Keras provided layer. I plan to write in more detail about my Attention implementations in a subsequent blog post.

I was pleasantly surprised at the many insightful questions I got from the audience during and after the talk. I also made a few friends and had detailed conversations around transfer learning, among other things. I also got a very nice demo of a system which automatically learns taxonomies from text which I thought was very interesting.

Chatbots - Past, Present and Future - Dr Rutu Mulkar-Mehta

This was a fairly high level talk but very interesting to me, since I know almost nothing about chatbots and because it may be one of the most obvious places to use NLP in. Typically chatbot designers tend to outsource the NLP analysis and concentrate on the domain expertise, so a number of chatbot platforms have come up that cater to this need, with varying degrees of sophistication. Some examples are Chatterbot, API.AI,, etc. She talked about the need to extract features from the incoming text in order to feed machine learning classifiers at each stage in the chatbot pipeline for it to decide how to respond. In all, a nice introduction to chatbots, seen as a pipeline of NLP and ML components.

PyData "Pub" Quiz - James Powell

To end the day, we had a 6-part quiz on Python, conducted by the inimitable James Powell. Many of the questions had to do with Python 3 features, Monty Python, and esoteric aspects of the Python language, so not surprisingly, I did not do too well. About the only thing I could answer were the two features of Python 3 that I always import from __future__ - the print_function and Python 3 style division, and some calls in matplotlib and scikit-learn. But I did learn a lot of things that I didn't know before, always a good thing.

There was a social event after this with food and drink. I hung around for a while and had some interesting conversations, then decided to call it a day.

Day 3 (July 7, 2017)

Morning Keynote - Joseph Sirosh

Joseph Sirosh of Microsoft talked about all the cool things that Microsoft is doing with Python and Machine Learning. He brought in various Microsoft employees to do brief demos of their work.

Medical Image Processing using Microsoft Deep Learning Framework (CNTK) - Naoto Usuyama and Jessica Lundin

Jessica started the presentation off by talking about a newly created Health division inside Microsoft that works as a startup within the parent company. This startup is doing many cool things in the Health and Medical spaces. After that Naoto talked about how he used CNTK to train models for the Diabetes Retinopathy and Lung Cancer challenges from Kaggle. His notebooks for both challenges are available on his pydata-medical-image repository on Github. I had been curious about CNTK but had never seen it in action, so this was interesting to me. Also I found the approach to preprocessing of the Lung Cancer dataset (DICOM) images interesting.

Learn to be a painter using Neural Style Painting - Pramit Choudhary

I find the whole idea of using neural networks to produce images that look like LSD induced hallucinations quite fascinating. However, while building models that generate these images, there have been certain aspects which I have kind of glossed over. One of the things the presenter briefly touched upon were the transformations on the content and style images before they are merged - this was one of those things I had glossed over earlier, so I figured I will ask him, but there was no time, so I decided to look it up myself, and this video of Leon Gatys's presentation at CVPR 2016 provides the clearest description I have seen so far. The presenter went on to explain how he used Spark to optimize the hyperparameters for the style transfer. He also gave me a jacket (his employer) for answering a very simple question correctly.

Scaling Scikit-Learn - Stephen Hoover

The presenter described the data science toolbox his company Civis Analytics markets and the reasoning behind various features. The toolbox is based on the AWS components Redshift, S3 and EC2. They have made some custom GridSearch components that leverages these different components. He also briefly described how one can build their own joblib based custom implementation of parallel_backend. The solution appears to be optimized for parallel batch operation against trained models during prediction. I thought it might be somewhat narrow in its scope, and based on a question from the audience at the end of the talk, think that it may be worth taking a look at Dask to see if some scaling issues may be solved more generally using it.

There were some more talks after lunch which I had to miss because I needed to catch my flight home, since I had already committed (before the talk proposal was accepted) to my family that I would be somewhere else starting Friday evening and through the weekend. So I will watch the videos for the ones I missed once the organizers release them. I will also update the links on this post once that happens.