Cloud9: Accessing the Google/IBM CLuE cluster

by Jimmy Lin

(Page first created: 04 Sep 2008; last updated: )

Introduction

These instructions are intended help you get onto the Google/IBM Hadoop cluster, available via the NSF CLuE program and the Google/IBM Academic Cloud Computing Initiative (ACCI) to University of Maryland students who are working in the context of those two projects. Note that the access method described on this page is experimental and subject to change at any time as best practices are refined.

In these instructions, I will refer to the following machines:

  • GATEWAY: the gateway machine
  • JOBTRACKER: the node that runs the jobtracker
  • NAMENODE: the HDFS name node

In all cases, replace the references with the actual IP addresses of the machines (which you should have received separately). I am not publishing the IP addresses of these machines as a security measure.

1. Unpack Hadoop

Unpack Hadoop on your local machine. The current stable release is 0.17.2, which is located in umd-hadoop-dist/hadoop/. Unpack the tarball hadoop-0.17.2.tar.gz.

2. Site up configuration files

On your local machine, open up hadoop-site.xml in hadoop-0.17.2/conf and add the following properties:

<property>
<name>fs.default.name</name>
<value>hdfs://NAMENODE:9000</value>
</property>

<property>
<name>mapred.job.tracker</name>
<value>JOBTRACKER:9001</value>
</property>

<property>
<name>hadoop.rpc.socket.factory.class.default</name>
<value>org.apache.hadoop.net.SocksSocketFactory</value>
</property>

<property>
<name>hadoop.socks.server</name>
<value>localhost:6789</value>
</property>

Make sure you substitute the actual IP address for NAMENODE and JOBTRACKER.

3. Open up a SOCKS proxy

On your local machine, open up a shell and invoke the following command:

ssh -D 6789 username@GATEWAY

Make sure you substitute your real username and the actual IP address for GATEWAY. Enter your password and log in when prompted.

Here, I'm assuming that you have ssh installed. In Windows, your best bet is to install Cygwin; ssh isn't installed by default, so you'll have to manually specify the package.

4. Test drive the cluster

With the above steps and bit of luck, you should be able to access the Hadoop cluster from your local machine! Basically, you've set up a direct connection to the Hadoop cluster via a SOCKS proxy.

Try things out with a simple command:

hadoop dfs -ls /

If this works, then you're basically all set. You have access to all the hadoop commands from your local machine, e.g., hadoop jar... to submit a job, etc. With this setup I find it not necessary to use to Eclipse plug-in for Hadoop.

Note: Hadoop only works with SOCKS v5, so if you have an older version of ssh that only supports SOCKS v4, then you should upgrade (or get your admin to). This is a known problem with CLIP machines! This issue manifests in really unhelpful error messages.

5. Connect to the Webapps

To access the jobtracker Webapp, you'll have to get your browser to use the SOCKS proxy. The preferred solution is FoxyProxy for Firefox. First, download and install FoxyProxy from this link. Then, take the following steps:

  1. Select menu item "Tools" > "FoxyProxy" > "Options".
  2. Click "Add New Proxy".
  3. In "General" tab, make sure the proxy is enabled. Give it a name, e.g., "Google/IBM cluster"
  4. In "Proxy Details" tab, select "Manual Proxy Configuration", enter "127.0.0.1" for "Host Name", 6789 for "Port", check "SOCKS Proxy?", and make sure "SOCKS v5" is selected.
  5. In "Patterns" tab, click "Add New Pattern". Given it a name (e.g., "default"), and enter the pattern "http://JOBTRACKER:*/*". Substitute the real IP address.
  6. You might want to add a similar pattern for accessing the HDFS Webapp on "http://NAMENODE:*/*", or generalize the two into a single pattern.
  7. Click "OK" to finish setting up the proxy.
  8. Back in FoxyProxy Options, change "Mode" to "Use proxies based on their pre-defined patterns and priorities". Click "Close" to finish up.

If you navigate to http://JOBTRACKER:50030/ in your browser, you should be able to access the jobtracker!

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!