Cloud9: Getting started with EC2

by Jimmy Lin

(Page first created: 01 Aug 2008; last updated: )

Introduction

This tutorial will get you started with Cloud9 on Amazon's EC2. By the end of this tutorial, you will have successful run a word count demo in Cloud9. Before we begin, a few notes:

  • For writing these instructions I used Hadoop 0.17.0 and Sun's Java JDK 1.6.0_06 on Windows XP (with Cygwin). However, these instructions should be applicable to other operating systems.
  • For Windows users: If you are using Windows, you must use Cygwin, as the EC2 tools will not work with the Windows command prompt. You will also need ssh, which is not installed as part of Cygwin by default. Once you've installed Cygwin, run the setup program and specifically install it.
  • Note that I'm showing commands as they apply to me: you'll have to change paths, name of machines, etc. as appropriate.
  • In capturing traces of commands running, I use the convention of [...] to indicate places where the output has been truncated.
  • You'll be typing a lot of commands on the command-line. What I've found helpful is to keep a text file open to keep track of the commands I've entered. This is useful for both fixing inevitable typos in command-line arguments and for retracing your steps later.
  • It is best to allocate an uninterrupted block of time to this tutorial, because once you start up an EC2 cluster, you're being charged by time.
  • There's a section at the end of this tutorial on common issues.

Just to give you an overview, here are the steps:

Let's get started!

Step 0: Download various software packages

Download and install the following software packages:

  • Java: (if you don't know what Java is you probably shouldn't be here...)
  • Eclipse: an IDE for Java
  • Subclipse: a Subversion client plug-in for Eclipse
  • Apache Ant: a build system for Java

Note that you technically do not need Ant for this tutorial to work, but it's a nice thing to have anyway for later.

Step 1: Check Out Subversion repositories

Start Eclipse. You'll to have tweak Subclipse options. Open Eclipse preferences (under Windows: Window > Preferences; under Mac: Eclipse > Preferences). Select option Team > SVN. Change SVN interface to "SVNKit".

Switch to repository exploring mode. To do so, select menu option: Window > Open Perspective > Other > SVN Repository Exploring.

Add repository by right clicking on left panel > New > Repository Location. The two repositories you want to check out are:

For each repository, expand the tree. You should see "branches", "tags", and "trunk". Right click on trunk > Checkout... Follow dialog to check out repository. When the checkout process is complete, open a shell and go to umd-hadoop-dist/hadoop/. Unpack the latest Hadoop 0.17.0 distribution:

$ tar xvfz hadoop-0.17.0.tar.gz

After Hadoop has been unpacked, go back to Eclipse, and switch back to the Java perspective. You should have two new projects: umd-hadoop-core and umd-hadoop-dist. If you're still getting errors, you might need to recompile the projects. Select menu option: Project > Clean...

You'll note that Hadoop itself is checked in the repository umd-hadoop-dist. This is intentional, to ensure everyone is using the same version of Hadoop (makes debugging easier). In the future, newer versions can be seamlessly rolled out by checking in a newer version in Subversion and having everyone update their sandboxes.

Step 2: Setting up Amazon EC2

Go to the Amazon AWS site to set up an AWS account. Follow the steps at Amazon's EC2 Getting Started Guide, up through the "Generating a Keypair" section of the "Running an Instance" page (page 5).

Helpful tip: when you generate an access key, try to avoid one that has a slash (/) in it—this will save you some effort later, because some tools cannot properly parse the slash and treats it as a path. If you get an access key that has a slash, just regenerate.

Step 3: Fire up a Hadoop cluster in the clouds!

You'll want to consult Running Hadoop on Amazon EC2 for reference, but I'll summarize the instructions below. To begin, make sure you're working with Hadoop 0.17.0 and have the EC2 environment variables properly set (see previous step). Note that all the Amazon tools begin with ec2-, which distinguishes them from tools bundle in the Hadoop distribution.

In case you are curious, here's how you find all available Hadoop images with the Amazon EC2 tool.

$ ec2-describe-images -x all | grep hadoop
IMAGE   ami-f73adf9e    cs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml [...]
IMAGE   ami-c55db8ac    fedora8-hypertable-hadoop-kfs/image.manifest.xml [...]
IMAGE   ami-64f6130d    hadoop-ec2-images/hadoop-0.14.1.manifest.xml [...]
IMAGE   ami-381df851    hadoop-ec2-images/hadoop-0.15.3.manifest.xml [...]
IMAGE   ami-461df82f    hadoop-ec2-images/hadoop-0.16.1.manifest.xml [...]
IMAGE   ami-ee53b687    hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml [...]
IMAGE   ami-f853b691    hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml [...]
IMAGE   ami-1220c57b    hadoop-ec2-images/hadoop-20071220.manifest.xml [...]
IMAGE   ami-fe7c9997    radlab-hadoop-4-large/image.manifest.xml [...]
IMAGE   ami-7f7f9a16    radlab-hadoop-4/image.manifest.xml [...]

Open the file umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2/bin/hadoop-ec2-env.sh. Fill in the following variables with information from you own account:

  • AWS_ACCOUNT_ID (no dashes)
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

If you're using Windows, you may also want to tweak the following variables:

  • MASTER_PRIVATE_IP_PATH
  • MASTER_IP_PATH
  • MASTER_ZONE_PATH

Those variables have ~ as default, which in Windows will map to something like "C:\Documents and Settings\...". That's a path containing spaces, which breaks some of the Hadoop EC2 scripts.

Open a shell and go to umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2. Launch a EC2 cluster and start Hadoop with the following command. You must supply a cluster name (test-cluster) and the number of slaves (2 in my case). At this point, it makes no sense to start up a large cluster (even if you can!)—one or two nodes is sufficient. After the cluster boots, the public DNS name will be printed to the console.

$ bin/hadoop-ec2 launch-cluster test-cluster 2
Testing for existing master in group: test-cluster
Starting master with AMI ami-ee53b687
Waiting for instance i-f6f22d9f to start
..........Started as ip-10-250-10-132.ec2.internal
[...]
Master is ec2-75-101-238-212.compute-1.amazonaws.com, ip is , zone is us-east-1b.
Adding test-cluster node(s) to cluster group test-cluster with AMI ami-ee53b687
i-cff22da6
i-cef22da7

Note: In Cygwin, the script may complain about not being able to find the dig DNS utility. There doesn't appear to be a standard Cygwin package that contains the utility, but not having it is okay (you'll notice that the actual IP address for the cluster is missing).

The meter has just started running... so you're being billed for usage starting now. After a little bit, you should be able to access the job track on port 50030 of the master, which in my case is:

http://ec2-75-101-238-212.compute-1.amazonaws.com:50030/

Congratulations, you've just started a Hadoop cluster in the clouds! If it doesn't work at first, the machines are probably booting up... wait a bit and then check again. You can use the EC2 tools to see the instances you're running:

$ ec2-describe-instances
RESERVATION     r-be3bf3d7      613871172339    test-cluster-master
INSTANCE        i-f6f22d9f      ami-ee53b687    ec2-75-101-238-212.compute-1.amazonaws.com [...]
RESERVATION     r-b73bf3de      613871172339    test-cluster
INSTANCE        i-cff22da6      ami-ee53b687    ec2-75-101-224-162.compute-1.amazonaws.com [...]
INSTANCE        i-cef22da7      ami-ee53b687    ec2-75-101-236-71.compute-1.amazonaws.com  [...]

Pretty cool, huh?

Step 4: Test drive the cluster

Let's log into the cluster and poke around. Open a shell on your local machine and go to umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2. Log into the master:

$ bin/hadoop-ec2 login test-cluster

Now run the pi demo:

[root@ip-10-250-10-132 hadoop-0.17.0]# cd /usr/local/hadoop-0.17.0/
[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop jar hadoop-0.17.0-examples.jar pi 10 1000
Number of Maps = 10 Samples per Map = 1000
Wrote input for Map #0
[...]
Wrote input for Map #9
Starting Job
08/08/01 19:59:01 INFO mapred.FileInputFormat: Total input paths to process : 10
08/08/01 19:59:02 INFO mapred.JobClient: Running job: job_200808011938_0001
[...]
Job Finished in 32.679 seconds
Estimated value of PI is 3.1244

Okay, so the value of pi is a bit off... but you've successfully submitted your first Hadoop job! Go back to the job tracker and you should see the run.

Note that Hadoop 0.17.0 may complain about deprecated filesystem name. Don't worry about it.

Step 5: Transfer some data

Now we're getting ready to run the word count demo. Before doing that, you have to first transfer some data over to the cloud (those are the words we're counting). Make sure you're still logged into the master. The following command shows you what's in HDFS.

[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -ls
Found 2 items
/mnt    <dir>           2008-08-01 19:38        rwxr-xr-x       root    supergroup
/user   <dir>           2008-08-01 19:58        rwxr-xr-x       root    supergroup

Right now, not much, but we're going to put some stuff there. Open a shell on your local machine and go to umd-hadoop-dist/. Now scp the sample data over to the master.

$ scp -i YOUR_PATH/id_rsa-gsg-keypair sample-input/bible+shakes.* root@ec2-75-101-238-212.compute-1.amazonaws.com:/tmp

Of course, substitute YOUR_PATH and the name of your master accordingly. As a convention, I like to copy things over to /tmp on the master. Note that you're being charged for bandwidth usage in moving data into the clouds. Another note: you actually only need one of the files to run the word count demo, but it makes sense just to copy everything over anyway, in case you want to play with the other demos.

So now the data is on the local drive of the master. Next, we have to put the data into HDFS, i.e., distribute the data across all nodes in the cluster. Create the appropriate directories in HDFS:

[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -mkdir /shared
[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -mkdir /shared/sample-input

The dump the data into HDFS:

[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -put /tmp/bible+shakes.* /shared/sample-input

Now you should be able to see the data in HDFS.

[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -ls /shared/sample-input
Found 3 items
/shared/sample-input/bible+shakes.nopunc        <r 3>   9068074 [...]
/shared/sample-input/bible+shakes.nopunc.packed <r 3>   13117764 [...]
/shared/sample-input/bible+shakes.nopunc.packed2        <r 3>   24561594 [...]

Data's there... now for the code.

Step 6: Transfer some code and run the word count demo

Back on your machine, open a shell and go to umd-hadoop-core/build/ (which is where Eclipse automatically puts compiled class files). Jar up the class files:

$ jar cvf cloud9.jar edu/

If there's nothing in build/, go back to Step 1 and make sure the code compiled okay. Once you have created the jar, copy it over:

$ scp -i YOUR_PATH/id_rsa-gsg-keypair cloud9.jar root@ec2-75-101-238-212.compute-1.amazonaws.com:/tmp

And finally, submit the job to the cluster:

[root@ip-10-250-10-132 hadoop-0.17.0]# hadoop jar /tmp/cloud9.jar edu.umd.cloud9.demo.DemoWordCount
08/08/01 20:36:56 INFO mapred.FileInputFormat: Total input paths to process : 1
08/08/01 20:36:57 INFO mapred.JobClient: Running job: job_200808011938_0002
[...]
08/08/01 20:37:45 INFO mapred.JobClient:  map 100% reduce 100%
08/08/01 20:37:46 INFO mapred.JobClient: Job complete: job_200808011938_0002
08/08/01 20:37:46 INFO mapred.JobClient: Counters: 16
08/08/01 20:37:46 INFO mapred.JobClient:   File Systems
08/08/01 20:37:46 INFO mapred.JobClient:     Local bytes read=2703600
08/08/01 20:37:46 INFO mapred.JobClient:     Local bytes written=5443371
08/08/01 20:37:46 INFO mapred.JobClient:     HDFS bytes read=9091882
08/08/01 20:37:46 INFO mapred.JobClient:     HDFS bytes written=149157
08/08/01 20:37:46 INFO mapred.JobClient:   Job Counters
08/08/01 20:37:46 INFO mapred.JobClient:     Launched map tasks=20
08/08/01 20:37:46 INFO mapred.JobClient:     Launched reduce tasks=1
08/08/01 20:37:46 INFO mapred.JobClient:     Data-local map tasks=20
08/08/01 20:37:46 INFO mapred.JobClient:   Map-Reduce Framework
08/08/01 20:37:46 INFO mapred.JobClient:     Map input records=156215
08/08/01 20:37:46 INFO mapred.JobClient:     Map output records=1734298
08/08/01 20:37:46 INFO mapred.JobClient:     Map input bytes=9068074
08/08/01 20:37:46 INFO mapred.JobClient:     Map output bytes=15919397
08/08/01 20:37:46 INFO mapred.JobClient:     Combine input records=1734298
08/08/01 20:37:46 INFO mapred.JobClient:     Combine output records=135372
08/08/01 20:37:46 INFO mapred.JobClient:     Reduce input groups=41788
08/08/01 20:37:46 INFO mapred.JobClient:     Reduce input records=135372
08/08/01 20:37:46 INFO mapred.JobClient:     Reduce output records=41788

Congratulations! You have now learned the basics of Hadoop on EC2. For future reference, steps 0-2 need only be done once. Step 3 starts your cluster in the clouds, so you have to do it before you start working, every time. Step 5 is required every time you want to process a new dataset—you have to copy it into the clouds. Step 6 represents your typical debug cycle: write code, pack it up, scp it over to the cluster, and submit the job.

You might be wondering, how do you see the actual output of the program? Word counts are stored in a directory in HDFS specified in the MapReduce program. You can see those files by issuing the following command:

[root@domU-12-31-38-00-41-42 hadoop-0.17.0]# hadoop dfs -ls sample-counts
[...]
Found 2 items
/user/root/sample-counts/_logs  <dir>           2008-09-16 11:18        rwxr-xr-x       root    supergroup
/user/root/sample-counts/part-00000     <r 3>   447180  2008-09-16 11:19        rw-r--r--       root    supergroup

A file gets created for each reducer, and the final key-value pairs get written there. Since this demo specifies only one reducer, we have only one file. You can fetch this file from HDFS and then do whatever you want with it (examine the output, scp back to your local machine, etc.):

[root@domU-12-31-38-00-41-42 hadoop-0.17.0]# hadoop dfs -get sample-counts/part-00000 .
[...]
[root@domU-12-31-38-00-41-42 hadoop-0.17.0]# head part-00000
&c      70
&c'     1
''all   1
''among 1
''and   1
''but   1
''how   1
''lo    2
''look  1
''my    1
[root@domU-12-31-38-00-41-42 hadoop-0.17.0]# tail part-00000
zorites 1
zorobabel       3
zounds  20
zuar    5
zuph    3
zur     5
zuriel  1
zurishaddai     5
zuzims  1
zwaggered       1

For more on working with HDFS, see this guide to HDFS shell commands.

And that's it! If you're feeling up to it, you might want to continue on directly to my next tutorial, getting started with S3. Otherwise, remember the most important thing of all...

Step 7: Clean up!

You'll want to clean up after you're done. Remember, the meter doesn't stop (i.e., the bill continue to accumulate, dime by dime) until you shut down your Hadoop cluster. To do so, open a shell on your local machine and go to umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2:

$ bin/hadoop-ec2 terminate-cluster test-cluster
Running Hadoop instances:
INSTANCE        i-f6f22d9f      ami-ee53b687    ec2-75-101-238-212.compute-1.amazonaws.com [...]
INSTANCE        i-cff22da6      ami-ee53b687    ec2-75-101-224-162.compute-1.amazonaws.com [...]
INSTANCE        i-cef22da7      ami-ee53b687    ec2-75-101-236-71.compute-1.amazonaws.com [...]
Terminate all instances? [yes or no]: yes
INSTANCE        i-f6f22d9f      running shutting-down
INSTANCE        i-cff22da6      running shutting-down
INSTANCE        i-cef22da7      running shutting-down

Confirm that the instances have indeed gone away:

$ ec2-describe-instances
RESERVATION     r-be3bf3d7      613871172339    test-cluster-master
INSTANCE        i-f6f22d9f      ami-ee53b687                    terminated [...]
RESERVATION     r-b73bf3de      613871172339    test-cluster
INSTANCE        i-cff22da6      ami-ee53b687                    terminated [...]
INSTANCE        i-cef22da7      ami-ee53b687                    terminated [...]
1b      aki-a71cf9ce    ari-a51cf9cc

And this concludes your first adventure in the clouds!

Postscript

Log into your AWS account and check your bill: right side of your screen, "Your Web Services Account" drop-down menu, "AWS Account Activity". How much cash did you burn?

In case you're interested, it cost me $0.82 to initially write this tutorial... I know, that's pretty inefficient—I used a total of eight node hours. In my defense, it took me a couple of tries to start up the cluster correctly (due to path problems), and I was writing at the same time...

Have fun computing in the clouds!

Common Issues

In Windows, avoid paths with ~ and paths with space

Paths with spaces break the EC2 scripts. Avoid using ~ in paths also since it (depending on setup) maps to C:\Document and Settings\..., which has spaces in it.

EC2 Authentication errors

If you're getting an error such as the following:

Client.AuthFailure: AWS was not able to validate the provided access credentials

Check to see if you've actually signed up for your EC2 account. Note that once you've signed up for your AWS account, you still must sign up for EC2.

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!