Cloud9: Exercises: Boolean retrieval

by Jimmy Lin

(Page first created: 18 Aug 2008; last updated:

Write a Boolean retrieval engine that uses the index created in the inverted index construction exercise. Your retrieval engine should be able to handle complex Boolean queries of the following form:

( slings AND arrows ) OR ( mortal AND coil )
( et AND tu AND brute )

Your system will need to parse nested Boolean queries. To make query processing simpler, you can choose the notation in which the queries are expressed: infix notation (as above) or prefix notation, e.g., "(AND slings arrows)". If you want to be particularly clever, write a query parser for reverse polish notation—it has many advantages, one of which is that all expressions can be unambiguously interpreted without parentheses.

As with previous exercises, use the sample input included in Cloud9, the Bible and the complete works of Shakespeare. As a bare minimum, your Boolean retrieval engine must be able to:

  • Process arbitrarily complex, nested queries, implementing both the "AND" and "OR" operators.
  • Return the docno and full text of the lines from the sample collection that matches the query.

For example, in response to the query "outrageous AND fortune", your retrieval engine should return something like:

4442172  the slings and arrows of outrageous fortune

To give another example, in response to the query "white AND rose", your retrieval engine should return something like:

7841087  from off this brier pluck a white rose with me
7841354  i pluck this white rose with plantagenet
7841879  giving my verdict on the white rose side
7841972  lest bleeding you do paint the white rose red
7842315  in sign whereof i pluck a white rose too
7842458  shall dye your white rose in a bloody red
7845524  shall send between the red rose and the white
8237199  until the white rose that i wear be dyed
8275306  the red rose and the white are on his face
9067070  we will unite the white rose and the red

Once you have the Boolean search engine working, try out these sample queries:

means AND deceit
(white OR red ) AND rose AND pluck
(unhappy OR outrageous OR (good AND your)) AND fortune

Practical Tips

This exercise is meant to be completed outside of Hadoop. I would recommend the following steps:

  1. Build your inverted index in Hadoop, as per the inverted index construction exercise. If you set the number of reducers to one, all the postings will be written to a single file. This will make the index easier to manipulate. Obviously, this isn't a scalable solution, but it will suffice for this exercise.
  2. Retrieve your index from HDFS with the hadoop dfs -get command. For more details, see this guide on Hadoop Shell commands.
  3. Scp your inverted index from the Hadoop cluster onto your local machine.

If you haven't already, this exercise may be a good opportunity to learn about different output formats. An OutputFormat (see Hadoop API) describes how output key-value pairs are written to HDFS. By default, Hadoop uses TextOutputFormat, which writes out the key-value pairs in human-readable text format. This is good for you, but can be annoying if you want to further manipulate the output programmatically—since you'll have the read in the text file and parse the key-value pairs back into Java objects. As an alternative, you might want to consider SequenceFileOutputFormat; you can specify that format with a method in JobConf:

conf.setOutputFormat(SequenceFileOutputFormat.class);

If you do this, the output of your MapReduce jobs will be stored in one or more SequenceFiles. The advantage of SequenceFiles is that they store key-value pairs in a machine-readable format, i.e., as serialized byte representations of the Java objects (not human readable, but can be programmatically manipulated quite easily). The SequenceFile API provides methods for reading key-value pairs—saving you the effort of having to manually parse plain text files.

Along the same lines, you might also want to take a look at MapFileOutputFormat, which writes the output key-value pairs to... as you've guessed, MapFiles! These files have the additional advantage of supporting random access to the keys. Once again, see the API for details.

Extensions

Some optional features you might want to consider implementing are:

  • Support for the boolean "NOT" operator. Or try the "AND-NOT" operator, which is a bit easier. The query "(a AND-NOT b)" retrieves documents that contain a, but not b.
  • Ordering of retrieved documents based on the number of matching terms (or something similarly sensical). For example, given the query "(a AND b)", lines that have four occurrences of the query terms should be place ahead of lines that have three occurrences of the query terms.

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!