In this exercise, you are going to implement PageRank in MapReduce. Here are three separate graphs to work with:
- a small one (93 nodes, 195 edges)
- a medium one (316 nodes, 430 edges)
- a large one (1458 nodes, 3545 edges)
The files are tab-delimited adjacency list representations of the graphs. The first token on each line represents the unique id of the source node, and the rest of the tokens represent the target nodes (i.e., outlinks from the source node). If a node does not have any outlinks, its corresponding line will contain only one token (the source node id).
To make your lives easier, I've created a small demo program that
computes PageRank using
the JUNG package (2.0
alpha1). The relevant class
is edu.umd.cloud9.demo.SequentialPageRank, which takes
two command-line arguments: the first is a file containing the graph
(one of the ones above), and the second is the random jump factor (a
typical setting is 0.15).
If you run the program on the small network, with a random jump
factor of 0.15, these will be your command-line
arguments (relative to umd-hadoop-core):
../umd-hadoop-dist/cloud9-docs/exercises/sample-small.txt 0.15
And here's the sample output:
Number of components: 1 Number of edges: 195 Number of nodes: 93 Random jump factor: 0.15 PageRank of nodes, in descending order: 8709207 0.04821883130281773 11287582 0.03471311923002198 9650960 0.034389246806747764 12610128 0.03394520471428961 8553535 0.03237815039961933 12518224 0.030345338942444594 11044077 0.028403133546904703 ...
To help in your implementation, I've captured the complete output
of edu.umd.cloud9.demo.SequentialPageRank for each of the
above graphs:
- PageRank values for the small graph
- PageRank values for the medium graph
- PageRank values for the large graph
Hints
If you're stuck, you might want to take a look at the source code of the JUNG implementation.
In the networks above, there are a relatively large number of dangling nodes, i.e., nodes with no outlinks. This frequently happens in the Web context also; for example, pages the crawler hasn't downloaded yet would appear as "dangling". For these nodes, you'll need to figure out how to "spread" its PageRank score to the other nodes.
Here's the a good order in which to tackle the various issues:
- Build an implementation that does not handle the random jump factor and does not handling dangling nodes.
- Add in support for dangling nodes.
- Add in support for the random jump factor.
Postscript
If you're curious: the nodes in these graphs represent MEDLINE records (abstracts in the life sciences domain). The links represent content-similarity links, i.e., pairs of abstracts that are similar in the words they contain. For example, consider pmid (unique identifier in the MEDLINE collection) 8709207. See the "Related Links" panel on the right hand side of the browser? The data provided above represent instances of graphs defined by such links.