Saturday, June 04, 2011

Java Data Structure: Insertable StringBuilder

This post describes a data structure that was inspired by my younger son's laziness. In an attempt to keep me involved in my children's education, I have been assigned the chore of checking their homework each night when I get back from work. One of these (for my younger son) is a mini book-report (3-4 sentences) for something he has read that day. Invariably, in an attempt to complete his homework as quickly as possible and get back to more important things (like video games), he will construct sentences with missing words in them. When I point them out, he will do something like this:

1
2
3
          is                      up a                      the
This story about Jack. Jack climbed beanstalk. He then killed giant.
          ^                        ^                         ^

Well, perhaps not this bad, but you get the idea.

I have been working recently on a custom Solr component using the Lucene FastVectorHighlighter (some preliminary info here), involving a multi-pass highlighting strategy, where the first group of terms need to be highlighted in one color and the second group in a different color.

On the second pass (as in the first), I need the original source string, so I hit upon the idea of storing the inserts (the "is", "up a" and "the" in the example above) in a separate data structure and merge them in at the end of the processing. Not a hugely complex idea, of course, but one that could be useful to other people with similar needs, so I decided to share it here.

Here is the code for my InsertableStringBuilder class. This exposes the single StringBuilder method I often use (append), and an additional insert method that writes positional inserts into a SortedMap. The toString() method merges the contents of the StringBuilder and the contents of the inserts into a single string.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// Source: src/main/java/com/mycompany/hilite/InsertableStringBuilder.java
package com.mycompany.hilite;

import java.util.SortedMap;
import java.util.TreeMap;

/**
 * A simple data structure that behaves like a StringBuilder,
 * except you can add inserts into it at specific character
 * positions (0-based) containing strings to be inserted at 
 * these positions. The toString() method will merge the inserts
 * and the string and return the full buffer.
 */
public class InsertableStringBuilder {

  private StringBuilder buf;
  private SortedMap<Integer,String> inserts;
  
  /**
   * Default ctor.
   */
  public InsertableStringBuilder() {
    this.buf = new StringBuilder();
    this.inserts = new TreeMap<Integer,String>();
  }

  /**
   * Ctor to instantiate this object with an input String.
   * @param s the input String to append.
   */
  public InsertableStringBuilder(String s) {
    this();
    this.buf.append(s);
  }

  /**
   * Similar to StringBuilder.append(String). Allows appending
   * strings to the main input String.
   * @param s the String to append.
   */
  public void append(String s) {
    this.buf.append(s);
  }
  
  /**
   * Add an insert string at the specified position. If
   * an attempt is made to insert past the end of the current
   * input String an ArrayIndexOutOfBoundsException will be
   * thrown. If an insert already exists at the requested 
   * position, the replace parameter controls whether the
   * new insert overwrites the older one or is ignored. 
   * @param pos the position to insert into.
   * @param s the insert string.
   * @param replace if true, older insert at this position,
   *        if present, will be replaced by this newer one.
   */
  public void insert(int pos, String s, boolean replace) {
    if (pos > buf.length()) {
      throw new IndexOutOfBoundsException(
        "Can't insert past end of string (pos=" + 
        pos + ", len=" + buf.length() + ")");
    }
    if (! replace) {
      if (! this.inserts.containsKey(pos)) {
        this.inserts.put(pos, s);
      }
    } else {
      this.inserts.put(pos, s);
    }
  }
  
  /**
   * Merges the input String and all the insert Strings
   * to create the merged string.
   * @return the merged string.
   */
  @Override
  public String toString() {
    StringBuilder obuf = new StringBuilder();
    int pos = 0;
    for (int ipos : inserts.keySet()) {
      if (pos < ipos) {
        obuf.append(buf.subSequence(pos, ipos));
      }
      obuf.append(inserts.get(ipos));
      pos = ipos;
    }
    if (pos < buf.length()) {
      obuf.append(buf.subSequence(pos, buf.length()));
    }
    return obuf.toString();
  }
}

Currently I have chosen to only implement the StringBuilder methods I am interested in. I would have preferred to just extend StringBuilder but sadly, it is marked final, so I had to take the containment approach shown above. Of course, if you do need more methods exposed from StringBuilder, it is trivial to expose them by just delegating to the method in the underlying StringBuilder.

Using this class, the merging code in lines 226-245 in the CustomFastVectorHighlighterTest class described in my earlier post changes to:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
      InsertableStringBuilder isb = new InsertableStringBuilder();
      isb.append(StringUtils.substring(
        source, startFragPosition, endFragPosition));
      for (Range termPosition : termPositions) {
        isb.insert(termPosition.getMinimumInteger() - startFragPosition, 
          pretag, false);
        isb.insert(termPosition.getMaximumInteger() - startFragPosition, 
          posttag, false);
      }
      return isb.toString();

There are two things I find rather nice about this class. First, of course, is that it hides the actual (fairly tedious) code to merge inserts in its toString() method. Second, with this class it is now possible to merge inserts from multiple sources into a single string without too much effort.

Be the first to comment. Comments are moderated to prevent spam.