Monday, September 03, 2007

More Jackrabbit - using XPath queries

Last week, I posted my initial experiences with Apache Jackrabbit. I have learned a little more since then, although its still not enough for me to consider moving all our content over to it. In my last post, I built up a content repository with the following design:

1
2
3
4
5
6
7
 rootnode
    |
    +-- ${contentType}
            |
            +-- ${contentId} {
                  properties {title:"foo", etc}
                }

This design works for getting back content by contentId only, since we can simply navigate down to the contentId node directly using code such as this:

1
2
3
4
5
6
7
8
    Node contentIdNode = session.getRootNode().getNode(contentSource).getNode(contentId);
    DataHolder dataHolder = new DataHolder();
    PropertyIterator pi = contentNode.getProperties();
    while (pi.hasNext()) {
      Property prop = pi.nextProperty();
      dataHolder.setProperty(prop.getName(), prop.getValue().getString());
    }
    return dataHolder;

However, this approach breaks down completely when we want to search by some other attribute, such as URL, which is a fairly common occurence for content based applications. But not to worry ... all JCR compliant repositories, including Jackrabbit, provides an interface to query nodes using XPath, which is described in some detail in this article. In addition, Jackrabbit also provides an SQL interface, which is described here.

To allow for this kind of searching, I had to change the way I had set up my content in the repository, something like this:

1
2
3
4
5
6
7
 rootnode
    |
    +-- ${contentType}
            |
            +-- content {
                  properties {contentId:1234, title:"foo", etc}
                }

In addition, I had to set up two SearchIndex XML blocks in the repository.xml, as a nested tag within Workspace and Repository. This is because the QueryManager uses the built in Lucene index internally to pull out information from the repository. I did not include this configuration initially because I thought that this was for searching through content (such as on a content search page), rather than doing lookups within content. The delta for the repository.xml to include the SearchIndex is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<Repository>
  ...
  <Workspace name="${wsp.name}">
    ...
    <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
      <param name="path" value="${wsp.home}/index"/>
      <param name="useCompoundFile" value="true"/>
      <param name="minMergeDocs" value="100"/>
      <param name="volatileIdleTime" value="3"/>
      <param name="maxMergeDocs" value="100000"/>
      <param name="mergeFactor" value="10"/>
      <param name="maxFieldLength" value="10000"/>
      <param name="bufferSize" value="10"/>
      <param name="cacheSize" value="1000"/>
      <param name="forceConsistencyCheck" value="false"/>
      <param name="autoRepair" value="true"/>
      <param name="analyzer" value="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
      <param name="queryClass" value="org.apache.jackrabbit.core.query.QueryImpl"/>
      <param name="respectDocumentOrder" value="true"/>
      <param name="resultFetchSize" value="2147483647"/>
      <param name="extractorPoolSize" value="0"/>
      <param name="extractorTimeout" value="100"/>
      <param name="extractorBackLogSize" value="100"/>
    </SearchIndex>
  </Workspace>

  ...
  <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
    <param name="path" value="${rep.home}/repository/index"/>
  </SearchIndex>
  
</Repository>

With the new repository design, the loader changed slightly to hardcode the node name for the content node, and to include the contentId as a property of the content node. I added a getContentByUrl(String contentSource, String url) method in addition to the original getContent(String contentSource, String contentId). In fact the getContent() method also changed to use the XPath approach. They are shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  public DataHolder getContent(String contentSource, String contentId) throws Exception {
    if (session == null) {
      session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    }
    DataHolder dataHolder = new DataHolder();
    Workspace workspace = session.getWorkspace();
    QueryManager queryManager = workspace.getQueryManager();
    Query query = queryManager.createQuery("//" + contentSource + 
      "/content[@contentId='" + contentId + "']", Query.XPATH);
    QueryResult result = query.execute();
    NodeIterator ni = result.getNodes();
    while (ni.hasNext()) {
      Node contentNode = ni.nextNode();
      PropertyIterator pi = contentNode.getProperties();
      dataHolder.setProperty("contentId", contentId);
      dataHolder.setProperty("contentSource", contentSource);
      while (pi.hasNext()) {
        Property prop = pi.nextProperty();
        dataHolder.setProperty(prop.getName(), prop.getValue().getString());
      }
      break;
    }
    return dataHolder;
  }
  
  public List<DataHolder> getContentByUrl(String contentSource, String url) 
      throws Exception {
    List<DataHolder> contents = new ArrayList<DataHolder>();
    if (session == null) {
      session = repository.login(new SimpleCredentials("user", "pass".toCharArray()));
    }
    Workspace workspace = session.getWorkspace();
    QueryManager queryManager = workspace.getQueryManager();
    Query query = queryManager.createQuery("//" + contentSource + 
      "/content[@url='" + url + "']", Query.XPATH);
    QueryResult result = query.execute();
    NodeIterator ni = result.getNodes();
    while (ni.hasNext()) {
      Node childNode = ni.nextNode();
      contents.add(getContent(contentSource, childNode.getName()));
    }
    return contents;
  }

I haven't actually tried the SQL interface, but from the link to the mail, it should not be too hard to use that either. However, the SQL interface is Jackrabbit specific, so its probably not such a good idea if one wants to migrate to a different JCR compliant repository in the future.

I do plan on doing some more reading and experimenting with Jackrabbit, from the list of articles on this Jackrabbit Wiki page. If I find anything interesting, I will write about it.

2 comments (moderated to prevent spam):

Anonymous said...

Hey Sujit,
Could you please explain once more, why the repositry.xml contains two SearchIndex XML blocks?
For what do we need the SearchIndex in Workspace and in Repository. I'M really new in JAckrabbit.
I will really appreciate your help.
Thanks
Den

Sujit Pal said...

Hi Den, from what I could figure out (and from what I can remember, since its been a while after I temporarily shelved Jackrabbit because of its internal data storage as blobs), Jackrabbit uses the index located at wsp.home for its internal purposes and it must be there. If you want to access your content by means other than what you originally configured its primary key as, then you need to use the other index in the repo.home. I think I may have tried to use the wsp.home index initially, thinking it will work for what I needed, but then the error messages led me to setting up the second index.