Cloud9: Exercises: Bigram counts

by Jimmy Lin

(Page first created: 13 Aug 2008; last updated:

Before embarking on this exercise, read the following two papers (if you haven't already):

  • Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System. Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP-03), pages 29-43. Bolton Landing, New York.
  • Dean, Jeffrey and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pages 137-150. San Francisco, California.

Also, make sure you've gone through the downloading Cloud9 and getting started with EC2 tutorial.

Part I: Count the bigrams

This exercise is quite straightforward: take the word count example edu.umd.cloud9.demo.DemoWordCount and extend it to count bigrams. Bigrams are simply sequences of two consecutive words. For example, the previous sentence contains the following bigrams: "Bigrams are", "are simply", "simply sequences", "sequence of", etc. Work with the sample collection included in Cloud9, the Bible and the complete works of Shakespeare.

Questions to answer:

  1. How many unique bigrams are there?
  2. List the top ten most frequent bigrams and their counts.
  3. What fraction of all bigrams occurrences does the top ten bigrams account for? That is, what is the cumulative frequency of the top ten bigrams?
  4. How many bigrams appear only once?

Part II: From counts to conditional probabilities

Extend your program to compute the conditional probability of a word given the preceding word. That is, the output of the code should be a table of values for P(Wn|Wn-1). Hint: to compute P(B|A), count up the number of occurrences of the bigram "A B", and then divide by the number of occurrences of all the bigrams that start with "A".

Questions to answer:

  1. What are the five most frequent words following the word "light"? What is the probability of observing each word?
  2. Same question, except for the word "contain".
  3. If there are a total of N words in your vocabulary, then there are a total of N2 possible values for P(Wn|Wn-1)—in theory, every word can follow every other word (including itself). What fraction of these values are non-zero? In other words, what proportion of all possible events are actually observed? To give a concrete example, let's say that following the word "happy", you only observe 100 different words in the text collection. This means that N-100 words are never seen after "happy" (perhaps the distribution of happiness is quite limited?).

Practical Tips

Here are some practical tips for doing this exercise with EC2. The easiest way to get started is to write a new class directly inside the Cloud9 project in Eclipse. Once you have some code you want to run, pull up a shell and change directory to umd-hadoop-core/build/ (this is where Eclipse automatically puts compiled class files). Then create a jar with the following command:

jar cvf cloud9.jar *

You can then scp the jar over to your EC2 cluster and run the job, per Step 6 in the downloading Cloud9 and getting started with EC2 tutorial.

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!