Published: 2011-07-15
Tagged: ajax, couchdb, full text search, javascript, lucene, mapreduce

I will show how to exploit CouchDB's MapReduce mechanism to provide a minimalistic full text search on the documents in the database. In the end, I will use the map-phase (without reduce) and some client logic based on Ajax and JavaScript.

Native Full Text Search with CouchDB

Lucene - The Default

CouchDB doesn't provide a build-in full text search mechanism. The recommended procedure is to set-up the JVM based Lucene-indexer for CouchDB. This is certainly the right way to go in the general case, i.e. for getting good results. I tested it, and I was rather pleased with the quality of the search results. Unfortunately, there are some drawbacks too:

  • Installation and set-up of a JVM is required.

  • A python environment is required too, and finally

  • all components must be configured to work together.

Setting up Lucene becomes too much of a hassle if replication between various computers and maybe even mobile devices are involved.

Two Phase MapReduce - Theory

We discuss an approach to provide full-text-search on the basis of MapReduce queries. A two phase strategy presents itself immediately.

Phase One

In phase one, each word in a document and its origin (precisely the document-id when we consider CouchDB) is emitted. The result is a lexicographically sorted key/value list of words and IDs (refering to documents) which is sorted by the words. Such a result could look like the following:

{ "ruby" : "databases" }
              { "ruby" : "iso8601_javascript_ruby" }
              { "ruby" : "ot6" }
              { "ruby" : "ruby_rvm" }

The following listing (given in CoffeeScript syntax) shows a map-function for CouchDB that actually does this. It goes slightly beyond what we discussed so far; e.g. by excluding stop-words and providing the frequency of a word which will be used to rank documents later on:

(doc)->
                stopwords_en = {"a":true, "able":true, "about":true, "across":true}
                freq = {}
                ((doc.title + doc.content_source + doc.tags).toLowerCase().match(/\w+/g))
                  .forEach (word) ->
                    word = word.replace(/^[_]+/,"").replace(/[_]+$/,"")
                    if (not stopwords_en[word]?) # not a stopword
                      if (word.length>=3) # at least 3 characters
                        if (not (word.match /^\d+$/)? ) # not just a number
                          freq[word] = (freq[word] ? 0) + 1
                for own word,count of freq
                  emit(word,
                    {id:doc._id
                    ,count:count
                    })
              

Phase Two

In the second phase, the same word is aggregated with the various documents that contain the word. Following the previous example we would get a result similar to

{ "ruby" : [ "databases", "iso8601_javascript_ruby", "ot6", "ruby_rvm" ]}

The rest from here is straight forward. We combine the result-set of several words by the desired logic (AND in most cases) and possibly combine the frequencies to rank the results.

There is one particular catch when it comes to CouchDB: the right way to arrive at the result would be using something between Map and Reduce. I.e. a second map phase that accumulates all keys of the same word. Combining several phases is often used in Hadoop MapReduce, for example.

MapReduce in CouchDB is more restrictive: there is exactly one MapPhase which is optionally followed by exactly one reduce-phase. CouchDB expects the size of the output of the reduce-phase to shrink asymptotically fast, and it will throw and reduceoverflow_error_ otherwise. Unfortunately, our use case is unlikely to shrink fast. The data might shrink slightly in many cases, but it will not shrink at all in the worst case. There is a way to force CouchDB to perform the computation anyways. However, I'm not going to discuss it here since this restriction is very well founded.

Client-Side Aggregation

In a 2-tier scenario (that is CouchDB and a browser but nothing between them), the client can do the aggregation. We will take a conceptual look at this procedure.

  1. The client gets the complete list for one word (see phase one) by queering with ?key="word".

  2. It then reduces to the result as indicated in phase two.

The client does the same for other queried words and performs the desired logic with the list of ids, e.g. set-intersection for the AND relationship. The implementation shown in the ScreenCast following this paragraph uses asynchronous transport and caching. The JQuery and underscore libraries are used to that end.

As seen, this demo features also autocompletion for the search strings. The client side uses a slightly tweaked autocomplete plug-in from JQuery-UI.

Final Words

This approach will not work very well in all scenarios. Essentially, the limitations in CouchDB's MapReduce implementation force us to transport a considerable amount of data to the client. However, native full text search with CouchDB is feasible. It has some limitations, but it might be just "good enough". The overall set-up is drastically simpler compared to the standard procedure.