Cloud9: Exercises: PageRank exercise

by Jimmy Lin

(Page first created: 05 Feb 2008; last updated:

In this exercise, you are going to implement PageRank in MapReduce. Here are three separate graphs to work with:

The files are tab-delimited adjacency list representations of the graphs. The first token on each line represents the unique id of the source node, and the rest of the tokens represent the target nodes (i.e., outlinks from the source node). If a node does not have any outlinks, its corresponding line will contain only one token (the source node id).

To make your lives easier, I've created a small demo program that computes PageRank using the JUNG package (2.0 alpha1). The relevant class is edu.umd.cloud9.demo.SequentialPageRank, which takes two command-line arguments: the first is a file containing the graph (one of the ones above), and the second is the random jump factor (a typical setting is 0.15).

If you run the program on the small network, with a random jump factor of 0.15, these will be your command-line arguments (relative to umd-hadoop-core):

../umd-hadoop-dist/cloud9-docs/exercises/sample-small.txt 0.15

And here's the sample output:

Number of components: 1
Number of edges: 195
Number of nodes: 93
Random jump factor: 0.15

PageRank of nodes, in descending order:
8709207 0.04821883130281773
11287582 0.03471311923002198
9650960 0.034389246806747764
12610128 0.03394520471428961
8553535 0.03237815039961933
12518224 0.030345338942444594
11044077 0.028403133546904703
...

To help in your implementation, I've captured the complete output of edu.umd.cloud9.demo.SequentialPageRank for each of the above graphs:

Hints

If you're stuck, you might want to take a look at the source code of the JUNG implementation.

In the networks above, there are a relatively large number of dangling nodes, i.e., nodes with no outlinks. This frequently happens in the Web context also; for example, pages the crawler hasn't downloaded yet would appear as "dangling". For these nodes, you'll need to figure out how to "spread" its PageRank score to the other nodes.

Here's the a good order in which to tackle the various issues:

  1. Build an implementation that does not handle the random jump factor and does not handling dangling nodes.
  2. Add in support for dangling nodes.
  3. Add in support for the random jump factor.

Postscript

If you're curious: the nodes in these graphs represent MEDLINE records (abstracts in the life sciences domain). The links represent content-similarity links, i.e., pairs of abstracts that are similar in the words they contain. For example, consider pmid (unique identifier in the MEDLINE collection) 8709207. See the "Related Links" panel on the right hand side of the browser? The data provided above represent instances of graphs defined by such links.

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!