In this exercise, you'll be creating an inverted index. An inverted index is a data structure common to nearly all information retrieval systems. Let us consider the following famous lines from Shakespeare's Merchants of Venice:
1: if you prick us do we not bleed 2: if you tickle us do we not laugh 3: if you poison us do we not die and 4: if you wrong us shall we not revenge
An inverted index consists of a collection of postings lists, one associated with each unique term in the collection. Each postings list consists of a number of individual postings. Each posting holds a document identifier and the frequency (i.e., count) of the term in that document.
Let's treat each line in the above sample data as if it were a "document". The complete inverted index would look like:
and : 1 : (3, 1) bleed : 1 : (1, 1) die : 1 : (3, 1) do : 3 : (1, 1), (2, 1), (3, 1) if : 4 : (1, 1), (2, 1), (3, 1), (4, 1) laugh : 1 : (2, 1) not : 4 : (1, 1), (2, 1), (3, 1), (4, 1) poison : 1 : (3, 1) prick : 1 : (1, 1) revenge : 1 : (4, 1) shall : 1 : (4, 1) tickle : 1 : (2, 1) us : 4 : (1, 1), (2, 1), (3, 1), (4, 1) we : 4 : (1, 1), (2, 1), (3, 1), (4, 1) wrong : 1 : (4, 1) you : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
As you can see, we have a postings list for each word that appears in the collection. Let us look at the postings list corresponding to the term if in a bit more detail:
if : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
The number directly after the term is its document frequency or df for short. The df specifies the number of documents that contain this term. Since if appears in all four documents, its df is 4. Although the df can be easily reconstructed by counting the number of postings, it is often explicitly stored in the inverted index. The postings list contains a number of postings, each of which is a (docno, tf) tuple. The docno is simply a unique identifier for the document (one through four, in this case). The tf, which stands for term frequency, is the number of times the term appears in the document. The term if appears once in every document.
Your task: write a MapReduce program that builds an inverted index. You can choose to store the df explicitly if you wish, but that's optional. Your postings, however, should be exactly as described above.
Run the inverted indexer on the sample input included in Cloud9, the Bible and the complete works of Shakespeare. As with the above case, treat each line as if it were an individual "document". When you map over a plain text file, the key passed to the mapper contains the byte offset of the line from the beginning of the file, while the value contains the text of the line. Use this offset value as the unique docno. As part of this exercise you'll also need to write a utility that takes a given docno (i.e., the byte offset) and returns the associated line.
Questions to answer:
- Look up the postings corresponding to the term "starcross'd". There should only be one line in the entire collection that contains the term. What is that line? What's its docno (byte offset)?
- Look up the postings corresponding to the term "gold". Generate a histogram of tf values. That is, in how many lines does "gold" appear once, twice, three times, etc.?
- Do the same for the terms "silver" and "bronze".