Introduction
This tutorial will get you started with Cloud9 on Amazon's EC2. By the end of this tutorial, you will have successful run a word count demo in Cloud9. Before we begin, a few notes:
- For writing these instructions I used Hadoop 0.17.0 and Sun's Java JDK 1.6.0_06 on Windows XP (with Cygwin). However, these instructions should be applicable to other operating systems.
- For Windows users: If you are using Windows, you must use Cygwin, as the EC2 tools will not work with the Windows command prompt. You will also need ssh, which is not installed as part of Cygwin by default. Once you've installed Cygwin, run the setup program and specifically install it.
- Note that I'm showing commands as they apply to me: you'll have to change paths, name of machines, etc. as appropriate.
- In capturing traces of commands running, I use the convention of [...] to indicate places where the output has been truncated.
- You'll be typing a lot of commands on the command-line. What I've found helpful is to keep a text file open to keep track of the commands I've entered. This is useful for both fixing inevitable typos in command-line arguments and for retracing your steps later.
- It is best to allocate an uninterrupted block of time to this tutorial, because once you start up an EC2 cluster, you're being charged by time.
- There's a section at the end of this tutorial on common issues.
Just to give you an overview, here are the steps:
- Step 0: Download various software packages
- Step 1: Check Out Subversion repository
- Step 2: Setting up Amazon EC2
- Step 3: Fire up a Hadoop cluster in the clouds!
- Step 4: Test drive the cluster
- Step 5: Transfer some data
- Step 6: Transfer some code and run the word count demo
- Step 7: Clean up!
- Postscript
Let's get started!
Step 0: Download various software packages
Download and install the following software packages:
- Java: (if you don't know what Java is you probably shouldn't be here...)
- Eclipse: an IDE for Java
- Subclipse: a Subversion client plug-in for Eclipse
- Apache Ant: a build system for Java
Note that you technically do not need Ant for this tutorial to work, but it's a nice thing to have anyway for later.
Step 1: Check Out Subversion repositories
Start Eclipse. You'll to have tweak Subclipse options. Open Eclipse preferences (under Windows: Window > Preferences; under Mac: Eclipse > Preferences). Select option Team > SVN. Change SVN interface to "SVNKit".
Switch to repository exploring mode. To do so, select menu option: Window > Open Perspective > Other > SVN Repository Exploring.
Add repository by right clicking on left panel > New > Repository Location. The two repositories you want to check out are:
umd-hadoop-dist: https://subversion.umiacs.umd.edu/umd-hadoop/distumd-hadoop-core: https://subversion.umiacs.umd.edu/umd-hadoop/core
For each repository, expand the tree. You should see "branches",
"tags", and "trunk". Right click on trunk > Checkout... Follow
dialog to check out repository. When the checkout process is
complete, open a shell and go to umd-hadoop-dist/hadoop/.
Unpack the latest Hadoop 0.17.0 distribution:
$ tar xvfz hadoop-0.17.0.tar.gz
After Hadoop has been unpacked, go back to Eclipse, and switch back
to the Java perspective. You should have two new projects:
umd-hadoop-core and umd-hadoop-dist. If
you're still getting errors, you might need to recompile the projects.
Select menu option: Project > Clean...
You'll note that Hadoop itself is checked in the
repository umd-hadoop-dist. This is intentional, to
ensure everyone is using the same version of Hadoop (makes debugging
easier). In the future, newer versions can be seamlessly rolled out
by checking in a newer version in Subversion and having everyone
update their sandboxes.
Step 2: Setting up Amazon EC2
Go to the Amazon AWS site to set up an AWS account. Follow the steps at Amazon's EC2 Getting Started Guide, up through the "Generating a Keypair" section of the "Running an Instance" page (page 5).
Helpful tip: when you generate an access key, try to avoid one that has a slash (/) in it—this will save you some effort later, because some tools cannot properly parse the slash and treats it as a path. If you get an access key that has a slash, just regenerate.
Step 3: Fire up a Hadoop cluster in the clouds!
You'll want to
consult Running
Hadoop on Amazon EC2 for reference, but I'll summarize the
instructions below. To begin, make sure you're working with Hadoop
0.17.0 and have the EC2 environment variables properly set (see
previous step). Note that all the Amazon tools begin
with ec2-, which distinguishes them from tools bundle in
the Hadoop distribution.
In case you are curious, here's how you find all available Hadoop images with the Amazon EC2 tool.
$ ec2-describe-images -x all | grep hadoop IMAGE ami-f73adf9e cs345-hadoop-EC2-0.15.3/hadoop-0.15.3.manifest.xml [...] IMAGE ami-c55db8ac fedora8-hypertable-hadoop-kfs/image.manifest.xml [...] IMAGE ami-64f6130d hadoop-ec2-images/hadoop-0.14.1.manifest.xml [...] IMAGE ami-381df851 hadoop-ec2-images/hadoop-0.15.3.manifest.xml [...] IMAGE ami-461df82f hadoop-ec2-images/hadoop-0.16.1.manifest.xml [...] IMAGE ami-ee53b687 hadoop-ec2-images/hadoop-0.17.0-i386.manifest.xml [...] IMAGE ami-f853b691 hadoop-ec2-images/hadoop-0.17.0-x86_64.manifest.xml [...] IMAGE ami-1220c57b hadoop-ec2-images/hadoop-20071220.manifest.xml [...] IMAGE ami-fe7c9997 radlab-hadoop-4-large/image.manifest.xml [...] IMAGE ami-7f7f9a16 radlab-hadoop-4/image.manifest.xml [...]
Open the
file umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2/bin/hadoop-ec2-env.sh.
Fill in the following variables with information from you own account:
AWS_ACCOUNT_ID(no dashes)AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
If you're using Windows, you may also want to tweak the following variables:
MASTER_PRIVATE_IP_PATHMASTER_IP_PATHMASTER_ZONE_PATH
Those variables have ~ as default, which in Windows will map to something like "C:\Documents and Settings\...". That's a path containing spaces, which breaks some of the Hadoop EC2 scripts.
Open a shell and go
to umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2.
Launch a EC2 cluster and start Hadoop with the following command. You
must supply a cluster name (test-cluster) and the number of slaves (2
in my case). At this point, it makes no sense to start up a large
cluster (even if you can!)—one or two nodes is sufficient.
After the cluster boots, the public DNS name will be printed to the
console.
$ bin/hadoop-ec2 launch-cluster test-cluster 2 Testing for existing master in group: test-cluster Starting master with AMI ami-ee53b687 Waiting for instance i-f6f22d9f to start ..........Started as ip-10-250-10-132.ec2.internal [...] Master is ec2-75-101-238-212.compute-1.amazonaws.com, ip is , zone is us-east-1b. Adding test-cluster node(s) to cluster group test-cluster with AMI ami-ee53b687 i-cff22da6 i-cef22da7
Note: In Cygwin, the script may complain about not being able to find the dig DNS utility. There doesn't appear to be a standard Cygwin package that contains the utility, but not having it is okay (you'll notice that the actual IP address for the cluster is missing).
The meter has just started running... so you're being billed for usage starting now. After a little bit, you should be able to access the job track on port 50030 of the master, which in my case is:
http://ec2-75-101-238-212.compute-1.amazonaws.com:50030/
Congratulations, you've just started a Hadoop cluster in the clouds! If it doesn't work at first, the machines are probably booting up... wait a bit and then check again. You can use the EC2 tools to see the instances you're running:
$ ec2-describe-instances RESERVATION r-be3bf3d7 613871172339 test-cluster-master INSTANCE i-f6f22d9f ami-ee53b687 ec2-75-101-238-212.compute-1.amazonaws.com [...] RESERVATION r-b73bf3de 613871172339 test-cluster INSTANCE i-cff22da6 ami-ee53b687 ec2-75-101-224-162.compute-1.amazonaws.com [...] INSTANCE i-cef22da7 ami-ee53b687 ec2-75-101-236-71.compute-1.amazonaws.com [...]
Pretty cool, huh?
Step 4: Test drive the cluster
Let's log into the cluster and poke around. Open a shell on your
local machine and go
to umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2.
Log into the master:
$ bin/hadoop-ec2 login test-cluster
Now run the pi demo:
[root@ip-10-250-10-132 hadoop-0.17.0]# cd /usr/local/hadoop-0.17.0/ [root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop jar hadoop-0.17.0-examples.jar pi 10 1000 Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 [...] Wrote input for Map #9 Starting Job 08/08/01 19:59:01 INFO mapred.FileInputFormat: Total input paths to process : 10 08/08/01 19:59:02 INFO mapred.JobClient: Running job: job_200808011938_0001 [...] Job Finished in 32.679 seconds Estimated value of PI is 3.1244
Okay, so the value of pi is a bit off... but you've successfully submitted your first Hadoop job! Go back to the job tracker and you should see the run.
Note that Hadoop 0.17.0 may complain about deprecated filesystem name. Don't worry about it.
Step 5: Transfer some data
Now we're getting ready to run the word count demo. Before doing that, you have to first transfer some data over to the cloud (those are the words we're counting). Make sure you're still logged into the master. The following command shows you what's in HDFS.
[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -ls Found 2 items /mnt <dir> 2008-08-01 19:38 rwxr-xr-x root supergroup /user <dir> 2008-08-01 19:58 rwxr-xr-x root supergroup
Right now, not much, but we're going to put some stuff there. Open
a shell on your local machine and go to umd-hadoop-dist/.
Now scp the sample data over to the master.
$ scp -i YOUR_PATH/id_rsa-gsg-keypair sample-input/bible+shakes.* root@ec2-75-101-238-212.compute-1.amazonaws.com:/tmp
Of course, substitute YOUR_PATH and the name of your
master accordingly. As a convention, I like to copy things over
to /tmp on the master. Note that you're being charged
for bandwidth usage in moving data into the clouds. Another note: you
actually only need one of the files to run the word count demo, but it
makes sense just to copy everything over anyway, in case you want to
play with the other demos.
So now the data is on the local drive of the master. Next, we have to put the data into HDFS, i.e., distribute the data across all nodes in the cluster. Create the appropriate directories in HDFS:
[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -mkdir /shared [root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -mkdir /shared/sample-input
The dump the data into HDFS:
[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -put /tmp/bible+shakes.* /shared/sample-input
Now you should be able to see the data in HDFS.
[root@ip-10-250-10-132 hadoop-0.17.0]# bin/hadoop dfs -ls /shared/sample-input Found 3 items /shared/sample-input/bible+shakes.nopunc <r 3> 9068074 [...] /shared/sample-input/bible+shakes.nopunc.packed <r 3> 13117764 [...] /shared/sample-input/bible+shakes.nopunc.packed2 <r 3> 24561594 [...]
Data's there... now for the code.
Step 6: Transfer some code and run the word count demo
Back on your machine, open a shell and go
to umd-hadoop-core/build/ (which is where Eclipse
automatically puts compiled class files). Jar up the class files:
$ jar cvf cloud9.jar edu/
If there's nothing in build/, go back to Step 1 and
make sure the code compiled okay. Once you have created the jar, copy
it over:
$ scp -i YOUR_PATH/id_rsa-gsg-keypair cloud9.jar root@ec2-75-101-238-212.compute-1.amazonaws.com:/tmp
And finally, submit the job to the cluster:
[root@ip-10-250-10-132 hadoop-0.17.0]# hadoop jar /tmp/cloud9.jar edu.umd.cloud9.demo.DemoWordCount 08/08/01 20:36:56 INFO mapred.FileInputFormat: Total input paths to process : 1 08/08/01 20:36:57 INFO mapred.JobClient: Running job: job_200808011938_0002 [...] 08/08/01 20:37:45 INFO mapred.JobClient: map 100% reduce 100% 08/08/01 20:37:46 INFO mapred.JobClient: Job complete: job_200808011938_0002 08/08/01 20:37:46 INFO mapred.JobClient: Counters: 16 08/08/01 20:37:46 INFO mapred.JobClient: File Systems 08/08/01 20:37:46 INFO mapred.JobClient: Local bytes read=2703600 08/08/01 20:37:46 INFO mapred.JobClient: Local bytes written=5443371 08/08/01 20:37:46 INFO mapred.JobClient: HDFS bytes read=9091882 08/08/01 20:37:46 INFO mapred.JobClient: HDFS bytes written=149157 08/08/01 20:37:46 INFO mapred.JobClient: Job Counters 08/08/01 20:37:46 INFO mapred.JobClient: Launched map tasks=20 08/08/01 20:37:46 INFO mapred.JobClient: Launched reduce tasks=1 08/08/01 20:37:46 INFO mapred.JobClient: Data-local map tasks=20 08/08/01 20:37:46 INFO mapred.JobClient: Map-Reduce Framework 08/08/01 20:37:46 INFO mapred.JobClient: Map input records=156215 08/08/01 20:37:46 INFO mapred.JobClient: Map output records=1734298 08/08/01 20:37:46 INFO mapred.JobClient: Map input bytes=9068074 08/08/01 20:37:46 INFO mapred.JobClient: Map output bytes=15919397 08/08/01 20:37:46 INFO mapred.JobClient: Combine input records=1734298 08/08/01 20:37:46 INFO mapred.JobClient: Combine output records=135372 08/08/01 20:37:46 INFO mapred.JobClient: Reduce input groups=41788 08/08/01 20:37:46 INFO mapred.JobClient: Reduce input records=135372 08/08/01 20:37:46 INFO mapred.JobClient: Reduce output records=41788
Congratulations! You have now learned the basics of Hadoop on EC2. For future reference, steps 0-2 need only be done once. Step 3 starts your cluster in the clouds, so you have to do it before you start working, every time. Step 5 is required every time you want to process a new dataset—you have to copy it into the clouds. Step 6 represents your typical debug cycle: write code, pack it up, scp it over to the cluster, and submit the job.
You might be wondering, how do you see the actual output of the program? Word counts are stored in a directory in HDFS specified in the MapReduce program. You can see those files by issuing the following command:
[root@domU-12-31-38-00-41-42 hadoop-0.17.0]# hadoop dfs -ls sample-counts [...] Found 2 items /user/root/sample-counts/_logs <dir> 2008-09-16 11:18 rwxr-xr-x root supergroup /user/root/sample-counts/part-00000 <r 3> 447180 2008-09-16 11:19 rw-r--r-- root supergroup
A file gets created for each reducer, and the final key-value pairs get written there. Since this demo specifies only one reducer, we have only one file. You can fetch this file from HDFS and then do whatever you want with it (examine the output, scp back to your local machine, etc.):
[root@domU-12-31-38-00-41-42 hadoop-0.17.0]# hadoop dfs -get sample-counts/part-00000 . [...] [root@domU-12-31-38-00-41-42 hadoop-0.17.0]# head part-00000 &c 70 &c' 1 ''all 1 ''among 1 ''and 1 ''but 1 ''how 1 ''lo 2 ''look 1 ''my 1 [root@domU-12-31-38-00-41-42 hadoop-0.17.0]# tail part-00000 zorites 1 zorobabel 3 zounds 20 zuar 5 zuph 3 zur 5 zuriel 1 zurishaddai 5 zuzims 1 zwaggered 1
For more on working with HDFS, see this guide to HDFS shell commands.
And that's it! If you're feeling up to it, you might want to continue on directly to my next tutorial, getting started with S3. Otherwise, remember the most important thing of all...
Step 7: Clean up!
You'll want to clean up after you're done. Remember, the meter
doesn't stop (i.e., the bill continue to accumulate, dime by dime)
until you shut down your Hadoop cluster. To do so, open a shell on
your local machine and go
to umd-hadoop-dist/hadoop/hadoop-0.17.0/src/contrib/ec2:
$ bin/hadoop-ec2 terminate-cluster test-cluster Running Hadoop instances: INSTANCE i-f6f22d9f ami-ee53b687 ec2-75-101-238-212.compute-1.amazonaws.com [...] INSTANCE i-cff22da6 ami-ee53b687 ec2-75-101-224-162.compute-1.amazonaws.com [...] INSTANCE i-cef22da7 ami-ee53b687 ec2-75-101-236-71.compute-1.amazonaws.com [...] Terminate all instances? [yes or no]: yes INSTANCE i-f6f22d9f running shutting-down INSTANCE i-cff22da6 running shutting-down INSTANCE i-cef22da7 running shutting-down
Confirm that the instances have indeed gone away:
$ ec2-describe-instances RESERVATION r-be3bf3d7 613871172339 test-cluster-master INSTANCE i-f6f22d9f ami-ee53b687 terminated [...] RESERVATION r-b73bf3de 613871172339 test-cluster INSTANCE i-cff22da6 ami-ee53b687 terminated [...] INSTANCE i-cef22da7 ami-ee53b687 terminated [...] 1b aki-a71cf9ce ari-a51cf9cc
And this concludes your first adventure in the clouds!
Postscript
Log into your AWS account and check your bill: right side of your screen, "Your Web Services Account" drop-down menu, "AWS Account Activity". How much cash did you burn?
In case you're interested, it cost me $0.82 to initially write this tutorial... I know, that's pretty inefficient—I used a total of eight node hours. In my defense, it took me a couple of tries to start up the cluster correctly (due to path problems), and I was writing at the same time...
Have fun computing in the clouds!
Common Issues
In Windows, avoid paths with ~ and paths with space
Paths with spaces break the EC2 scripts. Avoid using ~ in paths
also since it (depending on setup) maps to C:\Document and
Settings\..., which has spaces in it.
EC2 Authentication errors
If you're getting an error such as the following:
Client.AuthFailure: AWS was not able to validate the provided access credentials
Check to see if you've actually signed up for your EC2 account. Note that once you've signed up for your AWS account, you still must sign up for EC2.