Sunday, May 29, 2016

Elasticsearch based Image Search using RGB Signatures


In my previous post, I described some experiments I was doing to reduce images to a Bag of Visual Words (BOVW). My goal is to build a Content Based Image Retrieval (CBIR), i.e., a system that searches images based on their pixel content rather than text captions or tags associated with them. Furthermore, I would like to use a standard text search engine to do this - a lot of effort has gone into making these engines robust and scalable. So if I model the image search as a text search over a BOVW and deploy it to one of these engines, I get the robust and scalable part for free.

Just like text search, image search is also a balance between precision and recall. You do want the "right" result to appear on top, but you also want to see other results "like" the one you asked for. My attempt to model this fuzziness is to bin the pixels into coarser buckets along each channel; that way this post-processed image looks a little more like other images (thus improving recall) but still looks more similar to similar images than dissimilar images (thus not impacting precision too much).

In my previous post, I had experimented with KMeans clustering for binning the pixels along each channel, but it turned out to be too intensive when run across my entire butterfly corpus of 200 images from Photorack. In any case, I read later that KMeans is not very effective in a single dimension, so I switched to binning the pixels along each channel into 25 equal sized bins. At the end of this process, each image becomes a document composed of a vocabulary of 75 unique "words". Here is the code to read the images and write the corresponding data for loading into a search index.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# -*- coding: utf-8 -*-
# Source: es_build_rgb.py
import matplotlib.pyplot as plt
import os

import image_search_utils

INPUT_DIR = "../data/butterflies"
OUTPUT_FILE = "../data/butterflies_rgb.txt"

fout = open(OUTPUT_FILE, 'wb')
for fname in os.listdir(INPUT_DIR):
    print("Processing file: %s" % (fname))
    img = plt.imread(os.path.join(INPUT_DIR, fname))
    words = image_search_utils.get_vector_rgb(img)
    words_str = " ".join([w[0] + "|" + ("%.3f" % (w[1])) for w in words])
    fout.write("%s\t%s\n" % (fname, words_str))
fout.close()

For convenience, I factored some of the common functionality into a utils package. One of these is the functionality to convert an image of shape (640, 480, 3) into a vector as described above. We need to do the same transformation on an incoming image as well, so this is also called by the search code. The code for the package is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# -*- coding: utf-8 -*-
# Source: image_search_utils.py
from __future__ import division, print_function
import collections
import numpy as np
import operator

def get_vector_rgb(img):
    """ Vectorizes an RGB image. Each channel's pixel intensities are
        binned separately into 25 equal size bins, and the frequencies
        counted. The end result is a vector of bin "pixel-words" that 
        are formed by one of ["r", "g", "b"] and the mid point of the
        bin. Each such pixel-word is associated with its frequency,
        arbitarily normalized to 100. Output is a list of (pixel-word,
        normalized-frequency) tuples ordered by highest to lowest 
        normalized frequency.
    """
    colors = ["r", "g", "b"]
    bins = np.linspace(0, 256, 25)
    means = 0.5 * (bins[1:] + bins[:-1]).astype("uint8")
    words = []
    for i in range(len(colors)):
        px_orig = img[:, :, i].flatten()
        labels = np.searchsorted(bins, px_orig)
        px_reduced = np.choose(labels.ravel(), means, 
                               mode="clip").astype("uint8").tolist()
        counter = collections.Counter(px_reduced)
        words.extend([(colors[i] + str(x[0]), x[1]) for x in counter.items()])
    words_sorted = sorted(words, key=operator.itemgetter(1), reverse=True)
    max_freq = words_sorted[0][1]
    words_sorted = [(x[0], 100.0 * x[1] / max_freq) for x in words_sorted]
    return words_sorted
    
def search(vec, es, index_name, doc_type, top_n=35, start=0, size=10):
    """ Does a payload search on the underly Elasticsearch index.
        The top N features of the image vector are used to search.
        The function query called implements cosine similarity so
        the output scores are between 1 and 0. Output is a list of
        file names and scores tuples, ordered by descending order 
        of score.
    """
    top_vec = vec[0:top_n]
    top_vec_as_text = " ".join([x[0] for x in top_vec])
    top_vec_as_param = ",".join(["\""+x[0]+"|"+str(x[1])+"\"" for x in top_vec])
    query = """
{
    "from": %d,
    "size": %d,
    "query": {
        "function_score": {
            "query": {
                "match": {
                    "imgsig": "%s"
                }
            },
            "script_score": {
                "script": {
                    "lang": "groovy",
                    "file": "payload",
                    "params": {
                        "params": [ %s ]
                    }
                }
            }
        }
    }
}
    """ % (start, size, top_vec_as_text, top_vec_as_param)
    resp = es.search(index=index_name, doc_type=doc_type, body=query)
    hits = resp["hits"]["hits"]
    return [(hit["_source"]["filename"], hit["_score"]) for hit in hits]

The output of the vectorization step looks something like this. Each line represents a single image. The record consists of the file name followed by a tab, then followed by 75 pixel words and their frequencies separated by a pipe. Each pixel word starts with either r, g, or b to indicate the layer. The number following the character is the mid point of one of the bins. The frequency has been somewhat arbitarily normalized to a top value of 100. In retrospect I don't think I needed to do this.

1
2
3
4
5
1132868439-121.jpg     g122|100.000 r122|81.176 b122|71.557 r112|31.578 ...
1132868439-1210.jpg    r122|100.000 g122|94.113 b122|92.171 b37|24.449 ...
1132868439-12100.jpg   r122|100.000 g122|93.931 b122|92.702 b37|16.903 ...
1132868439-12101.jpg   r122|100.000 g122|96.838 b122|95.064 g26|34.409 ...
1132868439-12102.jpg   b122|100.000 g122|95.313 r122|94.820 r69|17.510 ...

For the index, I used Elasticsearch (ES) 2.3.3. Looking at the data format above, you probably guessed that I plan to use Lucene's Payloads feature. I have used Payloads before with Solr, but I am using ES more nowadays, so I figured it would be good to explore how to use ES for Payloads as well.

With the help of the OReilly ES Book (a freebie from Elasticon 2016), and some awesome advice on StackOverflow, I was finally able to get this data loaded. The code for creating the index schema and loading the data is shown below. If you look past the boilerplate, I am basically just declaring the analyzer chain for payload data, and declaring that I have two fields in my index - filename which is a string (not text), and imgsig which is a payload field. Then I read the file I just generated and write the records into the index.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# -*- coding: utf-8 -*-
# Source: es_create_rgb.py + es_load_rgb.py
import elasticsearch

es = elasticsearch.Elasticsearch(hosts=[{
    "host": "localhost",
    "port": "9200"
}])
create_index = """
{
    "settings": {
        "analysis": {
            "analyzer": {
                "payloads": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "delimited_payload_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "rgb": {
            "properties": {
                "filename": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "imgsig": {
                    "type": "string",
                    "analyzer": "payloads",
                    "term_vector": "with_positions_offsets_payloads"
                }
            }
        }
    }
}
"""
resp = es.indices.create(index="butterflies", ignore=400, body=create_index)
print resp

fin = open("../data/butterflies_rgb.txt", 'rb')
line_nbr = 1
for line in fin:
    filename, imgsig = line.strip().split("\t")
    es.index(index="butterflies", doc_type="rgb", id=line_nbr, body={
        "filename": "%s" % (filename),
        "imgsig": "%s" % (imgsig)
    })
    line_nbr += 1
fin.close()

For querying the data, however, the advice on Stack Overflow did not work for me. Specifically, ES complained that it couldn't compile the Groovy script. I ended up moving the Groovy script out of the request and into the config/scripts directory of the ES server per the ES Scripting page docs. I also ended up modifying the script a little to make it emit cosine similarities scores instead of the sum of payload scores of the matched result as it was doing before. My initial objective was to make the score reflect the importance of the query image vector elements as well, but then I realized I could just normalize the number to get the cosine similarity. Here is the Groovy script for payload scoring. Some of the functions used are explained in more detail on the ES Advanced Scripting docs.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
numer = 0.0;
denom_l = 0.0;
denom_r = 0.0;
for (param in params) {
    def (word, weight) = param.tokenize('|');
    weight_l = weight.toFloat();
    denom_l = denom_l + (weight_l * weight_l);
    termInfo = _index["imgsig"].get(word, _PAYLOADS);
    for (pos in termInfo) {
        weight_r = pos.payloadAsFloat(0);
        numer = numer + (weight_l * weight_r);
        denom_r = denom_r + (weight_r * weight_r);
    }
}
return numer / Math.sqrt(denom_l * denom_r);

The image_search_utils package above contains the query to call this scoring function. A typical query looks like this. The query part matches the records which have the words r80 and r90 together, and the params provide the source words and frequencies to match with. Remember that each image can have a maximum of 75 features and are ordered by importance. So we can increase precision by increasing the number of features in our query image and increase recall by decreasing the number of features.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
{
    "query": {
        "function_score": {
            "query": {
                "match": {
                    "imgsig": "r80 r90"
                }
            },
            "script_score": {
                "script": {
                    "lang": "groovy",
                    "file": "payload",
                    "params": {
                        "params": [ "r80|10.0", "r90|10.0" ]
                    }
                }
            }
        }
    }
}

For data exploration, I built a little web application with CherryPy. The main page shows me thumbnails of all the images in my corpus. Clicking on an image does a search using that image as the query. Here is the code for the web application.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# -*- coding: utf-8 -*-
# Source: image_search_web.py
from __future__ import print_function
import cherrypy
import elasticsearch
import matplotlib.pyplot as plt
import os

import image_search_utils

IMAGE_DIR = "/full/path/to/image-search/data/butterflies"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
ES_DOCTYPE = "rgb"

class ImageSearchService(object):
    
    def __init__(self):
        self.img_fnames = self._load_images()
        self.es = elasticsearch.Elasticsearch(hosts=[{
            "host": ES_HOST,
            "port": ES_PORT
        }])
        self.index_name = ES_INDEXNAME
        self.doc_type = ES_DOCTYPE

    def _load_images(self):
        img_fnames = []
        for fname in os.listdir(IMAGE_DIR):        
            img_fnames.append(fname)
        return img_fnames

    @cherrypy.expose
    def index(self):
        html = ("""<table cellspacing="0" cellpadding="0" border="1" width="100%">""")
        for i in range(20):
            html += ("""<tr>""")
            for j in range(10):
                curr_fname = self.img_fnames[i * 10 + j]
                html += ("""<td><a href="/search?q=%s"><img src="images/%s" width="100" height="75"/></a></td>""" % 
                    (curr_fname, curr_fname))
            html += ("""</tr>""")
        html += ("""</table>""")
        return html

    @cherrypy.expose    
    def search(self, q):
        img = plt.imread(os.path.join(IMAGE_DIR, q))
        term_vec = image_search_utils.get_vector_rgb(img)
        results = image_search_utils.search(term_vec, self.es, self.index_name, self.doc_type)
        html = """<h3>Query:</h3><br/>"""
        html += """<b>Filename:</b> %s<br/>""" % (q)
        html += """<b>Image:</b><br/>"""
        html += ("""<img src="images/%s" height="75" width="100"/><br/><hr/>"""
            % (q))
        html += """<h3>Results</h3><hr/>"""
        html += """<table cellspacing="0" cellpadding="0" border="1" width="100%">"""
        html += """<tr><th>Rank</th><th>Filename</th><th>Image</th><th>Score</th></tr>"""
        rank = 1
        for result in results:
            html += """<tr valign="top">"""
            html += """<td align="center">%d</td>""" % (rank)
            html += """<td align="center">%s</td>""" % (result[0])
            html += """<td align="center"><img src="images/%s" height="75" width="100"/></td>""" % (result[0])
            html += """<td align="center">%.5f</td>""" % (result[1])
            html += """</tr>"""
            rank += 1
        html += """</table>"""
        return html
    
if __name__ == "__main__":
    cherrypy.config.update({
        "server.socket_host": "127.0.0.1",
        "server.socket_port": 8080
    })
    conf = {
        "/images": {
            "tools.staticdir.on": True,
            "tools.staticdir.dir": IMAGE_DIR
        }
    }
    cherrypy.quickstart(ImageSearchService(), config=conf)

The screenshots below show the search query page (also the home page) and a results page. Notice that the query image and first result image on the results page are identical. This is true for most searches I ran, although in a few cases the query result appeared in the second position, and in fewer cases, at lower positions.






In order to figure out how good the search was overall, I ran an evaluation to measure the change in Mean Reciprocal Rank (MRR), for a random set of 50 images, as I varied the number of query image features. Here is the code to do the evaluation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# -*- coding: utf-8 -*-
# Source: eval_feats_rgb.py
from __future__ import division, print_function
import elasticsearch
import matplotlib.pyplot as plt
import numpy as np
import os

import image_search_utils

IMAGE_DIR = "../data/butterflies"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
ES_DOCTYPE = "rgb"

# gather 50 random unique images from collection
np.random.seed(42)
test_image_ids = set(np.random.choice(200, 50, replace=False).tolist())
test_images = []
curr_idx = 0
for fname in os.listdir(IMAGE_DIR):
    if curr_idx in test_image_ids:
        test_images.append(fname)
    curr_idx += 1

es = elasticsearch.Elasticsearch(hosts = [{
    "host": ES_HOST,
    "port": ES_PORT
}])
mrrs = []
top_ns = range(0, 80, 5)
top_ns[0] = 1
for top_n in top_ns:
    print("Now running with top %d features..." % (top_n))
    mrr = 0.0
    for test_image in test_images:
        print("... querying with %s" % (test_image))
        img = plt.imread(os.path.join(IMAGE_DIR, test_image))
        img_vec = image_search_utils.get_vector_rgb(img)
        results = image_search_utils.search(img_vec, es, ES_INDEXNAME, 
                                            ES_DOCTYPE, top_n)
        result_images = [result[0] for result in results]                                            
        result_mrr = 0.1        
        for rank in range(len(result_images)):
            if result_images[rank] == test_image:
                result_mrr = 1.0 / (rank + 1)
                break
        mrr += result_mrr
    mrrs.append(mrr /  len(test_images))

plt.plot(top_ns, mrrs)
plt.xlabel("Number of features (Top N)")
plt.ylabel("Mean Reciprocal Rank (MRR)")
plt.grid(True)
plt.show()        

The output of this code is the chart below, which shows that we get the best MRR overall with around 35 features, and plateues thereafter. The MRR at the platueue is around 0.8, so most of our query images are returned within the top 2 ositions in the search results, which I think is quite good for such a simple model.


The flip side of this analysis is to see how the scores decrease by position. Understanding this behavior would enable us to set a threshold for "good" matches. Keeping the top_N parameter set to 35, we run a small subset of images and chart their scores by position. The code to do this is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# -*- coding: utf-8 -*-
# Source: eval_scores_rgb.py
from __future__ import division, print_function
import elasticsearch
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os

import image_search_utils

IMAGE_DIR = "../data/butterflies"
ES_HOST = "127.0.0.1"
ES_PORT = "9200"
ES_INDEXNAME = "butterflies"
ES_DOCTYPE = "rgb"
NUM_TESTS = 5

# gather 5 random unique images from collection
np.random.seed(42)
test_image_ids = set(np.random.choice(200, NUM_TESTS, replace=False).tolist())
test_images = []
curr_idx = 0
for fname in os.listdir(IMAGE_DIR):
    if curr_idx in test_image_ids:
        test_images.append(fname)
    curr_idx += 1

es = elasticsearch.Elasticsearch(hosts = [{
    "host": ES_HOST,
    "port": ES_PORT
}])

cmap = matplotlib.cm.get_cmap("Spectral")
color = np.linspace(0, 1, NUM_TESTS)

for i in range(NUM_TESTS):
    img = plt.imread(os.path.join(IMAGE_DIR, test_images[i]))
    img_vec = image_search_utils.get_vector_rgb(img)
    results = image_search_utils.search(img_vec, es, ES_INDEXNAME, ES_DOCTYPE,
                                        top_n=35, start=0, size=50)
    scores = [result[1] for result in results]
    plt.plot(range(len(scores)+1)[1:], scores, color=cmap(color[i]))   
plt.axhline(0.9, 0, 50, color='r', linewidth=1.5, linestyle="dotted")
plt.xlabel("Result Position (Rank)")
plt.ylabel("Score")
plt.grid(True)
plt.show()

And the chart generated by the above program is as follows. As can be seen, all the queries result in 5-10 images above the 0.9 mark (marked by the red dotted line), which might be a good threshold to use in this case if we were trying to find very similar images.


Thats all I have for today. In this post, I described about how ES handles Payloads and function queries. Looking back with the work I did with Solr payloads, I found the ES implementation quite intuitive and easy to use. I was also able to build an index where I store pre-built vectors (as opposed to have the index create vectors out of the text) and compute cosine similarities using function queries. I am actually quite pleasantly surprised at how well the functionality is working. In coming weeks, I plan on making some improvements, will have more to share once I do that.

10 comments (moderated to prevent spam):

Sujit Pal said...

Thanks for the kind words psytek, and thanks for the link, I will look at it. Going by your description, one important difference between your approach and mine is that you probably relied on the Lucene similarity scoring, whereas I used the function query and payloads feature to compute a cosine similarity. It probably doesn't matter since we are just comparing relative positions, but it provides a little more control. Also by varying the number of features you want to compare with, you can tune the amount of recall you want.

Sujit Pal said...

Actually, having read your paper, I see what you mean, and agree with your original statement about the main difference. Thanks for the pointer to your paper, it was very interesting and useful.

richard Mao said...

Thanks for so many helpful blog posts and detailed explanation !!! I am using your blog posts to learn new things . You make everything easier to understand .

Sujit Pal said...

Thanks Richard, glad it helped. Actually the blog helps me to learn new things myself :-).

amit said...

From which package is this "import image_search_utils"
Can you please tell me which package I need to install for this?

Sujit Pal said...

Hi Amit, the image_search_utils is just another python file (image_search_utils.py) in the same directory. I just put it into separate files for convenience. The code for this is provided in the blog, just search for "Source: image_search_utils" and you will find the code.

Unknown said...

Hi Sujit, it looks like your Groovy script is not doing a real cosine similarity because the document's norm (denom_r ) won't include terms which weren't part of the query. Unfortunately, I haven't found a way to iterate over all fields terms in ES scripts.

Sujit Pal said...

Thanks for pointing that out, I didn't realize that. But thinking some more about it, I feel it is still quite usable. Since words not in the query image contribute nothing, it doesn't make a difference in the numerator if these words are considered in the index image or not. The denominator will be lower for the index image as you pointed out, but because these don't exist in the numerator, the score will still be below 1, so it satisfies my need for a bounded score which is easy to understand. Second, the missing words will be ignored uniformly so the ranking should be unaffected. So incorrect definitely, but still quite usable :-).

Anonymous said...

numer = 0.0;
denom_l = 0.0;
denom_r = 0.0;
for (param in params) {
def (word, weight) = param.tokenize('|');
weight_l = weight.toFloat();
denom_l = denom_l + (weight_l * weight_l);
termInfo = _index["imgsig"].get(word, _PAYLOADS);
for (pos in termInfo) {
weight_r = pos.payloadAsFloat(0);
numer = numer + (weight_l * weight_r);
denom_r = denom_r + (weight_r * weight_r);
}
}
return numer / Math.sqrt(denom_l * denom_r);

to this codes to where we are adding?

Sujit Pal said...

This looks like the Groovy script, it should go into the config/scripts directory for Elasticsearch. See the ES documentation page.