Friday, May 01, 2009

HTTP Search Interface to Lucene using Mule

Introduction

For quite a while now, I've been thinking, off and on, about centralizing our search functionality. Currently, our indexes are deployed locally with the application, which is something of an operations nightmare. As we scale out by increasing the number of machines in our tiers, and introduce brand new tiers with new products, the situation can only get worse. Some time ago, I had built a simple RMI server which would be a central repository of all our indexes (perhaps scaled out horizontally using a load balancer), but that would have needed quite a bit of change to our codebase to perform reasonably, so I abandoned the idea. Other things came up and I forgot about this - from the looks of it, reports of operator nightmares seem to have been grossly exaggerated :-).

Why not Solr?

At this point, most of you would be thinking about Solr, and wonder why I am attempting to reinvent the wheel. Well, for a couple of reasons, actually:

  1. Solr is very customizable, but it offers no customization hook for the one place I need it most. Our search is really a meta-search, aggregating results from multiple internal sources, each of which can be backed by multiple indexes, each of which is built using radically different analyzers. Solr follows the one IndexSearcher per instance model, which is unlikely to change, since its update strategy is based on this assumption. We could probably use Solr's distributed search to get around that, but the performance penalty would be too high.
  2. Unlike Solr, our model of updating indexes is to simply replace them with a freshly built one. Logic to detect the availability of a new index is built into the code, so no application restarts are necessary. I could actually implement this with Solr with a custom RequestHandler, much simpler than is currently implemented in our code.

Why Mule?

I started thinking about this again recently after I attended a talk by Ken Yagen some weeks ago on Mule ESB at the EBig Java SIG. Instead of using RMI, this time I decided I would build something along the lines of Solr, ie, an HTTP interface to the indexes, and it seemed like a good way to get familiar with Mule. So here it is...

POM changes

I used the Maven archetype from here, and changed the version parameter to reflect the current version. In addition, I added in the dependency to the Mule HTTP Transport and Lucene 2.4. The differences are shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?xml version="1.0" encoding="UTF-8"?>
<project ...>
  ...
  <properties>
    <mule.version>2.2.1</mule.version>
  </properties>

  <dependencies>
    ...
    <!-- Add support for http -->
    <dependency>
      <groupId>org.mule.transports</groupId>
      <artifactId>mule-transport-http</artifactId>
      <version>${mule.version}</version>
      <scope>provided</scope>
    </dependency>
    <!-- Add Support for Lucene -->
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-core</artifactId>
      <version>2.4.0</version>
      <scope>compile</scope>
    </dependency>

  </dependencies>
  ...
</project>

The Configuration

Mule uses its own XML configuration. The configuration file shown below contains the details of the entire Mule service. You can find more details about configuring a Mule service here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/resources/mule-config-spring.xml -->
<mule xmlns="http://www.mulesource.org/schema/mule/core/2.2"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:spring="http://www.springframework.org/schema/beans"
       xmlns:http="http://www.mulesource.org/schema/mule/http/2.2"
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans 
       http://www.springframework.org/schema/beans/spring-beans-2.5.xsd
       http://www.mulesource.org/schema/mule/core/2.2 
       http://www.mulesource.org/schema/mule/core/2.2/mule.xsd
       http://www.mulesource.org/schema/mule/http/2.2 
       http://www.mulesource.org/schema/mule/http/2.2/mule-http.xsd">
       
  <!-- Application specific beans -->
  <spring:beans>
    <spring:import resource="classpath:components-spring.xml"/>
    <spring:bean id="propertyConfigurer"
        class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
      <spring:property name="location" 
        value="classpath:mule-config-spring.properties"/>
    </spring:bean>
  </spring:beans>       

  <!-- Connectors -->
  <http:connector name="httpConnector" enableCookies="false" keepAlive="true"/>
  
  <!-- Transformers -->
  <custom-transformer name="requestTransformer" 
    class="org.mule.transport.http.transformers.HttpRequestBodyToParamMap"/>

  <!-- Model -->
  <model name="main">
    <service name="searchService">
      <inbound>
        <http:inbound-endpoint address="http://localhost:8888/search" 
          synchronous="true" contentType="text/xml" 
          transformer-refs="requestTransformer"/>
      </inbound>
      <component>
        <spring-object bean="searchServiceUmo"/>
      </component>
    </service>
  </model>
</mule>

Mule configuration integrates very nicely with Spring's. The only application code in the service is the SearchServiceUmo, which is defined in the components-spring.xml file below, using standard Spring semantics. This file is referenced from the main configuration file using an import.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/resources/components-spring.xml -->
<beans xmlns="http://www.springframework.org/schema/beans"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.springframework.org/schema/beans 
  http://www.springframework.org/schema/beans/spring-beans.xsd">
  
  <bean id="searchServiceUmo" 
      class="com.mycompany.searchservice.SearchServiceUmo"
      init-method="init" destroy-method="destroy">
    <property name="indexPaths">
      <list>
        <value>/path/to/my/index</value>
      </list>
    </property>
  </bean>
</beans>

The Code

The workhorse class is the SearchServiceUmo. It takes in a Map of request parameters representing a query from a remote client, and executes a Lucene search against a local index. It then returns a List of result beans (a POJO, shown below), converted into an XML stream. One important thing to note is that there is no mention of any Mule API or classes, ie, coupling between application code and Mule is only via the XML wiring.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
// Source: src/main/java/com/mycompany/searchservice/SearchServiceUmo.java
package com.mycompany.searchservice;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.CachingWrapperFilter;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MultiSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryWrapperFilter;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;

import com.thoughtworks.xstream.XStream;

/**
 * User defined search service.
 */
public class SearchServiceUmo {

  private final Log log = LogFactory.getLog(getClass());

  private static final Analyzer ANALYZER = new StandardAnalyzer();
  
  private List<String> indexPaths;
  private Searcher searcher;
  
  public void setIndexPaths(List<String> indexPaths) {
    this.indexPaths = indexPaths;
  }
  
  protected void init() throws Exception {
    if (indexPaths.size() == 0) {
      throw new IllegalArgumentException(
        "At least one index must be specified");
    } else if (indexPaths.size() == 1) {
      this.searcher = new IndexSearcher(indexPaths.get(0));
    } else {
      Searcher[] searchers = new Searcher[indexPaths.size()];
      for (int i = 0; i < searchers.length; i++) {
        searchers[i] = new IndexSearcher(indexPaths.get(i));
      }
      this.searcher = new MultiSearcher(searchers);
    }
  }
  
  protected void destroy() {
    if (searcher != null) {
      try {
        searcher.close();
        searcher = null;
      } catch (Exception e) {
        log.warn("Searcher at " + indexPaths + 
          " could not be closed", e);
      }
    }
  }

  /**
   * For synchronous services, there does not seem to be a way to 
   * apply a transformation on the results returned from a component,
   * so we are doing this in code...its probably not that big a deal, 
   * since its only 2 lines of code, but it would be nice if we could 
   * do this using an available Mule component (ObjectToXml is available,
   * but cannot be used without complex shenanigans, as far as I can see).
   * @param params the request parameters as a Map of name-value pairs.
   * @return the response XML string.
   * @throws Exception if thrown.
   */
  public String search(Map<String,Object> params) throws Exception {
    List<SearchResultBean> beans = searchInternal(params);
    String result = new XStream().toXML(beans);
    return result;
  }
  
  private List<SearchResultBean> searchInternal
      (Map<String,Object> params) throws Exception {
    // we could probably write a custom transformer here to get a 
    // parameter object as our argument, which would allow for 
    // multiple params with the same name, and other good stuff, 
    // but we are lazy, so...
    if (params.containsKey("reopen")) {
      // this is for the batch update script
      destroy();
      init();
      return Collections.emptyList();
    } else {
      Query query = buildQuery((String) params.get("query"));
      Filter filter = buildFilter((String) params.get("filter"));
      Sort sort = buildSort((String) params.get("sort"));
      int startIndex = Integer.valueOf((String) params.get("start"));
      int endIndex = Integer.valueOf((String) params.get("end"));
      TopDocs td = searcher.search(query, filter, endIndex, sort);
      ScoreDoc[] sds = td.scoreDocs;
      List<SearchResultBean> results = 
        new ArrayList<SearchResultBean>();
      for (int i = startIndex; i < endIndex; i++) {
        Document doc = searcher.doc(sds[i].doc);
        results.add(new SearchResultBean(doc, sds[i].score));
      }
      return results;
    }
  }

  /**
   * Build up the query object using some standard rules. In this case,
   * our rules are (body:${q}) OR (title:${q})^4.0. We used Standard
   * Analyzer to tokenize both body and title at index time, so we must 
   * also use it for query building.
   * @param q the query term.
   * @return the Query object.
   */
  private Query buildQuery(String q) throws Exception {
    BooleanQuery query = new BooleanQuery();
    Query titleQuery = new QueryParser("title", ANALYZER).parse(q);
    titleQuery.setBoost(4.0F);
    query.add(titleQuery, Occur.SHOULD);
    Query bodyQuery = new QueryParser("body", ANALYZER).parse(q);
    query.add(bodyQuery, Occur.SHOULD);
    return query;
  }
  
  /**
   * Some filtering criteria. In our case, we know that our tags contain
   * our filtering criteria, so we use that. The parameter is a comma-
   * separated list of tags. The tags are indexed without tokenizing, so
   * we use a plain TermQuery here.
   * @param tags the tags to filter on.
   * @return a Filter object.
   */
  private Filter buildFilter(String tags) {
    if (StringUtils.isEmpty(tags)) {
      return null;
    }
    BooleanQuery query = new BooleanQuery();
    String[] tagArray = StringUtils.split(tags, ",");
    for (int i = 0; i < tagArray.length; i++) {
      TermQuery tquery = new TermQuery(new Term("tag", tagArray[i]));
      query.add(tquery, Occur.MUST);
    }
    return new CachingWrapperFilter(new QueryWrapperFilter(query));
  }

  /**
   * We always sort by the natural order of the sort fields specified.
   * If a field is prefixed with a '-', then we reverse the natural
   * sort order for that field. 
   * @param sortFields a comma-separated list of sort fields to sort by.
   * @return a Sort object for this search.
   */
  private Sort buildSort(String sortFields) {
    if (StringUtils.isEmpty(sortFields)) {
      return Sort.RELEVANCE;
    }
    String[] sortFieldArray = StringUtils.split(sortFields, ",");
    SortField[] sfs = new SortField[sortFieldArray.length];
    for (int i = 0; i < sortFieldArray.length; i++) {
      if (sortFieldArray[i].startsWith("-")) {
        sfs[i] = new SortField(sortFieldArray[i].substring(1), true);
      } else {
        sfs[i] = new SortField(sortFieldArray[i]);
      }
    }
    return new Sort(sfs);
  }
}

The SearchResultBean is a POJO. I have removed the getters and setters to keep the code short. Use your IDE to generate them. Note that if you want to deserialize the XML back into this bean on the client side, the bean must exist on the client's CLASSPATH as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Source: src/main/java/com/mycompany/searchservice/SearchResultBean.java
package com.mycompany.searchservice;

import java.io.Serializable;

import org.apache.commons.lang.builder.ReflectionToStringBuilder;
import org.apache.lucene.document.Document;

/**
 * Simple POJO to hold the contents of a search result. This should 
 * be available on the client side as well, in order for XStream to 
 * be able to deserialize this into a SearchResultBean.
 */
public class SearchResultBean implements Serializable {
  
  private static final long serialVersionUID = -2701792004759978895L;
  
  private String id;
  private String title;
  private String summary;
  private String[] tags;
  private String url;
  private float score;
  
  public SearchResultBean(Document doc, float score) {
    this.id = doc.get("id");
    this.title = doc.get("title");
    this.summary = doc.get("summary");
    this.tags = doc.getValues("tags");
    this.url = doc.get("url");
    this.score = score;
  }

  // ... getters and setters removed for brevity

  @Override
  public String toString() {
    return ReflectionToStringBuilder.toString(this);
  }
}

The Main class is not needed if you are using a Mule installation. In my case, this allows me to startup the Mule service from within my IDE, and is adapted from the template in the archetype.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Source: src/main/java/com/mycompany/searchservice/Main.java
package com.mycompany.searchservice;

import org.apache.log4j.BasicConfigurator;
import org.mule.api.MuleContext;
import org.mule.api.MuleException;
import org.mule.api.config.ConfigurationBuilder;
import org.mule.api.config.ConfigurationException;
import org.mule.api.context.MuleContextFactory;
import org.mule.api.lifecycle.InitialisationException;
import org.mule.config.spring.SpringXmlConfigurationBuilder;
import org.mule.context.DefaultMuleContextFactory;

/**
 * Launcher for the Mule based search service.
 */
public class Main {

  public static void main(String[] args) {
    BasicConfigurator.configure();
    MuleContext context = null;
    String[] resources = {"mule-config-spring.xml"};
    try {
      MuleContextFactory factory = new DefaultMuleContextFactory();
      ConfigurationBuilder builder = 
        new SpringXmlConfigurationBuilder(resources);
      context = factory.createMuleContext(builder);
      context.start();
      System.out.println("Starting Mule Instance");
    } catch (ConfigurationException e) {
      e.printStackTrace();
    } catch (InitialisationException e) {
      e.printStackTrace();
    } catch (MuleException e) {
      e.printStackTrace();
    }
  }
}

The Obligatory Screenshot

...just to show you how it all works. Typing in the URL:

1
http://localhost:8888/search?query=maven&start=0&end=10

Returns a screenful of XML as shown below:

Client code could call this using a simple HTTP Client, which would deserialize the XML (perhaps using XStream) back into a List of SearchResultBean objects, and use it as required in the application. In the case of a more Mule aware organization, the client would probably also be a Mule service.

Conclusion

As you can see, Mule provides a lot of components and XML wiring features that make it easy for the application developer to concentrate on the business logic and leave the integration details to Mule. However, while I was building this application, I realized that it would be more pragmatic (and easier) to just build a simple web application wrapper.

Obviously, this is not a reflection on the quality of Mule software. In a shop that is using Mule more heavily, this would probably be an ideal approach. However, ingesting the Mule elephant (sorry, mixing metaphors here) to just take advantage of its HTTP connector and a couple of transformers seems like a bit of overkill. Developers here are very familiar with Spring and Lucene, so building a simple web application is far simpler than learning the Mule architecture and all its components.

Another thing I noticed is that there seems to be more emphasis on asynchronous messaging in Mule, and perhaps rightly so. In my case, I would have liked to be able to wire up a transformer after my component runs, perhaps in an outbound component, but since my service is synchronous, I can only configure my inbound endpoint, which does not allow transformers after the component is run. I ultimately ended up putting the post-transformation in the service code itself. Of course, Mule is a work in progress, so I am sure the functionality will show up in a later version if it doesn't exist already. If you know how to achieve this, please let me know.

References

I found the following sites helpful during my development, if you want to try out something along similar lines, you will probably find them helpful too.

  • Maven Archetype for Mule Projects from the Morning Java blog. The archetype provides a simple example of a Mule service which is very helpful. It is based on Mule 2.0.0-RC2, but a simple version change in the POM got me set up with the current Mule version (2.2.1 at the time of this writing). In addition, I had to add in the dependency to Lucene and the Mule HTTP Connector (see the POM snippets above).
  • Mule Instance Configuration from the Mule documentation.

  • This page provides some information about modeling synchronous request-response style messaging in Mule.
  • A very informative article from InfoQ, written by Jackie Wheeler.
  • This discussion thread provided me with insight about how to handle HTTP connectors in the application.
  • I've been meaning to look at Solr for a while now, and I finally did it before starting on this application, to see if I could use Solr. The Solr Getting Started Guide was very helpful to set up a simple Solr instance which I could experiment with while going through the Solr code and documentation.

4 comments (moderated to prevent spam):

abhirama said...

Why don't you use an embedded HTTP server like jetty?

Sujit Pal said...

Hi abhirama, yes, thats a good idea. I ended up making it a simple webapp (with a war file) - the application I finally ended up with used JSON serialization (instead of XML using XStream) and also provides a single JSP page for fine tuning the query manually before putting it into code, similar to the Solr admin page. Without the admin page, it would have made a lot of sense to use embedded Jetty, thank you for the suggestion.

Ilango said...

Great article.
Do you have the finished webapp. I would love to test.
A few years I had played with Solr and XForms. I would love to see how Solr plays with Mule

Sujit Pal said...

Thanks Ilango. This post is actually using Mule to build a webservice wrapper over Lucene (not Solr). And no, I don't have this webapp anymore, I abandoned this idea in favor of a plain Spring app instead, although that approach isn't live either, since I haven't yet had a chance to convert over all our legacy code to work against a webservice.