Salmon Run: Trip Report: Haystack 2018

Earlier this week (April 10 and 11), I was at the Haystack Search Relevance conference at Charlottesville, VA. The conference was organized by Doug Turnbull and Eric Pugh from OpenSource Connections (o19s). Doug Turnbull (and his co-author John Berryman, who was also at the conference) introduced me many years ago to the world of principled search relevance evaluation through their book Relevant Search. The stated objective of the conference was to bring together people interested in search relevance, so we can collectively advance the state of the art in this area. To that end, a Slack Channel was created in advance of the conference, allowing people to form mental alliances even before they arrived at the conference, and to build upon ideas that resulted from actual meetings during the conference.

Although I knew of Doug through his book before I started at my current job, it was during my outreach efforts at our Search Guild (a loose internal group of search engineers, product managers, information retrieval specialists, UI/UX professionals, etc, interested in search) that I got to know him personally. He contacted me to check if I might be interested in presenting at Haystack, and if so, to submit an abstract. Now, even though my background is in search, lately the work that I am doing has more to do with Machine Learning (ML) than Search. And while I wasn't looking, Lucene/Solr/Elasticsearch (the search tools I am familiar with) have all jumped a couple or more major versions, so my skills in Search are dated as well. Almost on a whim and without much hope of it getting accepted, I submitted a presentation proposal around Image Search, describing some work I had done over last year, even though I wasn't very confident about whether it would be a good fit for this conference.

In retrospect, I am glad I did. The proposal was accepted and the resulting talk was quite well received. It also wasn't the complete outlier I had feared it would be. There was at least one talk that talked about image search (granted it was more about Image recognition and had better outcomes than mine), one talk that talked about word embeddings (tangentially related if you consider the vector representation I was using to be an image embedding), and a talk around a platform that supports tensor based search (arguably a better platform for Image Search than text based indexes such as Lucene).

Most importantly, I got to meet many people here who I had only known previously through their work in various open source projects. Hopefully, this will lead to me being more active on the Slack channel and actually learning something from other participants. As a home-based employee, an accidental benefit was meeting several Elsevier and LexisNexis colleagues who were attending/presenting as well. Also, I learned that Salmon Run (this blog) is far more popular in the search community than I had imagined, which was very gratifying, I am happy so many people find it useful.

In this post, I provide summaries of the talks I attended. There were two parallel tracks, so there were also a few (happening concurrently in the other track) I wish I could have attended, and I call them out here as well. Links to presentation slides were made available for most talks, so I have linked to them if they were available by the time I published this post.

DAY 1 (Tuesday, April 10)

Keynote - by Eric Pugh and Doug Turnbull
Eric started off the keynote by welcoming everyone, and talked about the motivation for coming up with the conference. Doug then spoke of how as a search engineer, he felt that conferences on search were either too academic or too focused on a single company or product. He wanted something that would be more focused on search relevance techniques regardless of what product you use. It was also a call to arms for the target audience, i.e., search experts involved in open source, to come together and build components that can be composed by others, making it easier and faster to build high quality search applications. He compared this to the standardization of tools such as the plunger in the plumbing industry. Interestingly, there is a golden plunger prominently displayed in the o19s lobby, so I guess he has used the analogy before.

Facets and Similarity - Exploring the Meta-Informational Hyperspace - by Ted Sullivan
Ted spoke about using facets not only to filter records as it is normally used, but as real metadata. He brought up how historically facets were called parameters by Verity, dimensions by Endeca, navigators by FAST and refiners by Microsoft FAST, showing how the notion of dimensionality is built into facets. He proposed the facet similarity theorem, which states that facets can be used to find similar things. Thus, a collections of facets could be used to compose feature vectors, which could then be compared to other feature vectors using notions of similarity in vector space. In addition, facets can be used as metadata to construct knowledge bases using entity and fact extraction, find paths in category space using pivot facets, build multi-dimensional query suggesters, provide a way to increase precision as user supplies more data, allow applications such as dynamic boosting of suggestions based on previous queries, and allow much more precise clustering than standard LDA. He also proposed Facet Ratios as a way to find keyword clusters. Topic maps and clusters using keywords generated through facet ratios (also known as Facet based clustering) ends up being cleaner than ones derived from raw TF/IDF. Facet based clustering is going to be incorporated into the LucidWorks Fusion product.

Algorithmic Extraction of Keywords, Concepts and Vocabularies - by Max Irwin
Max described various approaches he used to extract keywords, some of which I knew about but others were new to me (at least I hadn't tried them myself). Some tools for keyword extraction are gensim's implementation of Latent Dirichlet Allocation (LDA), the RAKE algorithm and Maui 2.0. He also described concept extraction techniques using POS tagging and edge labeling using SpaCy, Topia TermExtract and his own SkipChunk system. He also mentioned a system called TAXI (Taxonomy Induction) for taxonomy generation. He applied these techniques on the o19s blog corpus and produced results from each of these approaches for comparison. Lot of good information about tools here, especially interesting to me since I am currently doing something quite similar.

From clicks to models, the Wikimedia LTR pipeline - by Eric Bernhardson
Eric talked about the Wikimedia Learning To Rank (LTR) pipeline. Most of the talk was focused on the engineering aspects of the pipeline. He talked about the MjoLniR, a library written in PySpark and Scala, that transforms the click logs into ML models for ranking in ElasticSearch (ES). For a baseline, they developed a model that learns the existing ranking function. They use click models, a principled way to translate implicit preferences into unbiased labels, using DbnModel from the Python clickmodels library. These operate on groups of sessions with same intent. This allows them to optimize query results by adding specific rewrites based on predicted intent.

Expert Customers: A Hybrid Approach to Clickstream Analytics - by Elizabeth Haubert
Elizabeth talked about her experience building a ClickStream Analytics platform for search. She talked about the features that are normally considered - query features such as click position, number of clicks and query length, session features such as number of queries per session, number of no-click queries, session time, number of reformulaions, and the URLs visited during the session, as well as user features such as number of clicks, number of queries, the user's dwell time, and the similarity of this user to other users in the system. She talked about the importance of a labeled test set in addition to these captured features, since once the test set becomes available, we can, depending on the amount of human judgement information available, go from simple set differences to increasingly richer metrics such as Precision/Recall, Mean/Expected Reciprocal Rank (MRR/ERR), and Discounted Cumulative Gain (DCG and nDCG). She talked about using TREC data (Cranfield model) to validate her results. She also covered some common-sense techniques to increase your chances for getting good data from human testers, such as reducing task ambiguity by building stories and guidelines, and use a scale that does not overwhelm users. This was also an interesting talk for me since I am looking at some of these ideas myself.

In addition, one talk I wish I could have attended as well was Embracing Diversity: Searching over multiple languages by Jeff Zemerick and Suneel Marthi. Both speakers are seasoned Apache committers on various projects, including ones I have benefited from in the past, such as Mahout and OpenNLP. Their talk covered the need for multi-lingual search, the basics of Machine Translation (MT) such as alignment, phrase models, etc, and evaluation of MT using BiLingual Evaluation Understudy (BLEU) scores. They also introduced Apache Joshua, an Apache Incubator project to support statistical MT system for phrase-based, hierarchical and syntax based translations, written in Java.

The first day ended with a series of 5 minute lightning talks from participants. Lightning talks (as well as all Track 2 talks) were held in Random Row, a brewery and pub in the same compound as the o19s offices. The beer was awesome, definitely recommend it if you happen to be in the area, but next time I will remember to take notes as well, only because it is so hard to remember all the great ideas that came up in these talks. Here are a few that I do remember.

Representative Nouns by David Smiley - this is in the context of product search. The ideas is that there usually is a noun that uniquely represents the product, the idea is to identify and make it available as a searchable facet.
Concept Indexing by Shyamsundar Mutcha - my colleague Shyam gave a high level overview of our concept indexing and search algorithm and explained its benefits.
Solr Concordancer plugin by Tim Allison - Tim described a Solr plugin to generate concordances. This is useful for exploring your data. Tim and I got to talking after his talk about some use cases I had handled with a concordancer of my own earlier, and it turned out that my concordancer code served as inspiration for his plugin (which he maintains and makes sure it is aligned to various Lucene versions).
Solr explain plan visualizer - a Solr based plugin by Tom Burgmans that parses the JSON output from Solr's explain plan and produces a nice visualization that is easier to understand and use than the text version.
Querqy Solr Query Rewriter plugin by Rene Kriegler - a very useful Solr plugin that allows you to set up pattern matches to queries and associated rewrite rules.
Search Metrics - by Doug Rosenoff - my colleague Doug described a family of search metrics that provides greater expressivity as more and more training data is provided. Doug also did a full-length presentation LexisNexis Learning to Rank Case Study with Doug Heitkamp and Tito Sierra, which I did not attend as I had already attended an internal presentation they did on the same subject.

DAY 2 (Wednesday, April 11)

Learning to Rank in an Hourly Job marketplace - by Xun Wang and Jason Kowaleswski
Xun and Jason are from Snag, the largest online marketplace for hourly workers. Their objective is to match up a job seeker to multiple jobs, and recommend the best candidate for a given job. In that sense, the problem is of limited supply and unlimited access. A peculiarity of the hourly job marketplace is that often schedule and location are more important than the actual job content, and the queries reflect that, so the challenge is often to determine intent and context. They set out to migrate from their legacy rule based search system to a more modern one using the ES Learning to Rank (LTR) plugin by Doug Turnbull and the team at o19s. Relevancy signals are collected from multiple levels of interactions, such as clicks, intent classification, completed applications, interviews and actual hires, and is as much recommendation as it is search. The LTR model generates features for the ES LTR plugin by taking relevancy features and composing them using LambdaMart. Their migration is through a new parallel system which will gradually take on more and more of the existing workload as it gets closer in terms of relevancy performance to the existing system over time.

A picture is worth a thousand words - approaches to search relevance scoring based on product data, including image recognition - by Rene Kriegler
Rene described the various stages of eCommerce search (problem recognition, information search, alternative evaluation, purchase and post-purchase) and how each stage informs search intent. He also pointed out that in eCommerce search, each document is a proxy for the thing being searched for, and that consumer interest becomes part of the relevance criteria. Other parts of the relevance criteria are seller's perspective and personalization and individualization (i.e., topicality). This last part is where image recognition comes in. The idea is that you can compute the likelihood of a query given an image recognition vector subspace, i.e., relevancy is a function of the Jaccard similarity between image vectors for products that match a query. Image vectors in this model are generated from a pre-trained Inception network similar to this one. Clustering is done using Random Projections to produce binary vectors for each image, which can be clustered. Results corresponding to larger clusters are boosted and rescored. He humerously concludes that while a picture may or may not be worth a thousand words, it is definitely worth a language model. One other interesting point he brought up is that the binary vectors as a result of the random projections can be stored in Lucene bit vectors.

A Vespa Tour - by Matt Overstreet
Vespa bills itself as an open big data serving engine. It provides a hybrid indexing platform that supports both text and similarity searches in vector space. You can use it to build both standard text based search applications, personalized recommendations as well as machine learning oriented similarity engines. You can also build navigation pages computed on demand, and provides realtime data displays such as tag clouds, maps and graphs. Matt took us through configuring Vespa, using it for Linguistics use cases (mainly text search), and talked about the flexibility of Vespa's ranking. You can provide a middleware component to implment some kind of similarity and Vespa will support that similarity. Tensorflow (TF) models can also be embedded as part of this middleware, which means that you can dynamically compare records based on the notion of similarity that is embodied by the TF model. Vespa provides a query language called YQL (Yahoo Query Language). Both Python and Java are supported for developing middleware. I had been putting off looking at Vespa because of the complexity and breadth of the software, but looks like it might be worth checking out in connection with some of the work I have been doing around image similarity.

The Solr Synonyms Maze: Pros, Cons and Pitfalls of Various Synonym Usage Patterns - by Bertrand Rigaldies
Bertrand talks about the difference in the way Lucene handles multi-term synonyms on the query side versus the index side and the problems that can arise as a result. Specifically, the index side handles offsets for multi-term synonyms incorrectly, leading to flattening and weird results. Mike McCandless called this behavior sausagization and wrote this blog post in 2012 describing the problem. He uses the Solr JSON API and the Python networkx library to build visualizations of the query paths to demonstrate the problem and suggests several solutions. The best solution, if it is possible for your installation, is to reduce all synonyms to a single term semantic tokens. However, in many instances it is not feasible to do so, and the talk describes what strategies to take to prevent problems with synonym generation in those cases.

Evolving a Medical Image Similarity Search - by Sujit Pal
This was my talk. I talked about our (still incomplete) Medical Image Similarity project. I covered the various strategies for feature extraction starting with somewhat naive features such as color and texture, local features such as edges and corners, and ending up with deep learning features such as vectors generated from pre-trained image classification models. I then covered various indexing strategies I had considered, some of which allow you to represent image vectors as text based postings lists, and others which depend on platforms to compare image vectors natively. I also covered how we evaluated the various search algorithms and indexing strategies using human ratings on a 4-point scale, and while the results so far are not impressive in absolute terms, they do represent some progress compared to our baselines, showing that we are at least on the right track. Finally, I talked about various ideas that we would like to try, hopefully this will give us better results. Interestingly, some of the ideas were also mentioned by the other speakers, so it was good to get corraboration. I was quite impressed by the quality of the questions and suggestions I got, they were very well thought out and insightful.

The Relevance of Solr's Semantic Knowledge Graph - by Trey Grainger
Trey described the Semantic Knowledge Graph plugin for Solr contributed by LucidWorks and based upon his research at CareerBuilder. This can be used to discover and rank relationships between arbitary queries and terms in the index. Other uses could be to discover related terms and concepts, disambiguate different meanings of terms given the context, cleanup noise in datasets, discover unknown relationships between documents and fields, summarize documents, etc. It does this by maintaining a so-called forward index in addition to Lucene's existing inverted index. The forward index allows you to map from document to terms. Traversals happen by alternately traversing through the forward and the inverted index. Weights for each node are assigned as a ratio of the foreground vs background weights, which are in turn derived from other metrics. The code for this is open-source and available on github at careerbuilder/semantic-knowledge-graph. Trey also covers various applications of the Knowledge Graph such as data cleaning, predictive analytics, intelligent search expansion, document summarization and enrichment.

Catch my Drift? Building bridges with Word Embeddings - by Peter Dixon-Moses
Peter described an interesting strategy to increase recall using word2vec. Essentially, you look up neighbors for your query word using the word2vec.find_synonyms() call and then rewrite your query to look for them as well. Another approach could be also to generate thesaurus. He also pointed out an ElasticSearch plugin called elasticsearch-vector that allows you to do vector arithmetic. However, the problem with the approach that the similarity is applied to the entire corpus, so strategy when using vector matching should be to use it as a rescorer. Other applications could be to use embeddings representing history to rescore current results. He also suggested using analogies to do queries, for example, in a real estate search scenario, you could do cityname -professionals to find similar cities but without so many professionals (so hopefully cheaper). An interesting insight Peter provided is for training word2vec models with your own data. Just like fine tuning trained models, it should be possible to tune an embedding model on your own data by starting with a trained model and fine tuning with your own data.

Day 2 had many interesting talks and once again I was forced to make choices. Talks I missed because there was an equally (or more, at least to me) interesting talk on the other track are as follows. Very short summaries are provided based on a quick read of the slides, where they were available.

Understanding Queries with NER - by Ryan Pedela
The gentle art of incorporating "business rules" into results - by Scott Stults. Slides not available at the moment.
Realtime Entity Resolution with ElasticSearch - by Dave Moore. Dave talks about a ES plugin from Zentity that uses facets to identify and extract features for named entities in a search index.
Interleaving: from evaluation to self learning - by John T Kane. This is a very nice introduction into the idea of LTR. LTR models are usually trained using a batch process at the moment. The idea is to transform it into an online learning setup using Reinforcement Learning (RL), and continuous competition by interleaving results from competing engines instead of A/B tests.
Bad Text, Bad Search: Evaluating Text Extration with Apache Tika's tika-eval Module - by Tim Allison. Tim Allison is the creator of Apache Tika, a toolkit to extract text and metadata from over thousand different file types. He describes at a high level what Tika can do for you, and then focuses on the tika-eval module, which allows you to compare extraction results.

Overall, I thought this conference was very useful. Even though search is no longer my primary focus, most applications I am part of building rely on search to some extent. In addition, search itself is expanding to become more than just efficient information retrieval. Many innovations in search depend on Natural Language Processing (NLP), ML, and increasingly RL techniques. In a sense search relevance is tied to all these innovations, since each of them serve to push the relevance envelope a bit further, leading to better and more relatable results for human users. In that sense, I think Haystack has placed itself in a very interesting position. I learned of many such innovations in the two days I was at Haystack, and look forward to applying these ideas in my own work.