Introduction
In Hadoop, the two basic data types are:
WritableComparableis the base interface for keys, andWritableis the base class interface for values.
Classes implementing the above two interfaces provide your basic primitives:
IntWriteable(ints),Text(strings),BytesWritable(raw bytes),- etc.
So what do you do when you need to work with more complex data types? Well, here are your options:
- Option 1: hack it — treat everything as text, parse it manually (this gets ugly real quick).
- Option 2: use the Cloud9 bindings for JSON — easy to use, but runs pretty slow.
- Option 3: use the Cloud9 Tuple library — runs faster than using JSON.
- Option 4: roll your own custom data type — runs much faster (up to an order of magnitude), but I wouldn't recommend this if you're just starting out since it requires writing more code.
- Alternatives: other things to look at.
At the very end, I'm going to discuss some benchmarks comparing options 2, 3, and 4.
Option 1: Hack it yourself
The simplest way to represent complex data is to encode it
into Text. For example, you could represent lists as
comma delimited values. Once you read the text (e.g., in a mapper),
split the string on commas and reconstruct the individual list
elements. Before emitting a key or a value, encode the list back into
Text in the same way. The downside of this technique, of course, is
an ugly mess. This is a quick way to get started with Hadoop, but
you'll want to quickly move past this phase.
Option 2: JSON bindings in Cloud9
JSON (JavaScript Object Notation) is a lightweight, text-based data
interchange format for representing simple data structures.
The Wikipedia article
about JSON and json.org are
good places for more information. The basic idea is to use JSON
objects as complex data structure inside Hadoop. Included with
Cloud9
is Java support for JSON
in the org.json package.
As an example, the following JSON object:
{
"firstName": "John",
"lastName": "Smith",
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": 10021
},
"phoneNumbers": [
"212 555-1234",
"646 555-4567"
]
}
Can be built with the following code snippet:
JSONObject obj = new JSONObject();
obj.put("firstName", "John");
obj.put("lastName", "Smith");
JSONObject address = new JSONObject();
address.put("streetAddress", "21 2nd Street");
address.put("city", "New York");
address.put("state", "NY");
address.put("postalCode", 10021);
JSONArray phoneNumbers = new JSONArray();
phoneNumbers.put("212 555-1234");
phoneNumbers.put("646 555-4567");
obj.put("address", address);
obj.put("phoneNumbers", phoneNumbers);
Cloud9 provides support for JSON
via JSONObjectWritable,
which extends JSONObject to provide serialization and
deserialization capabilities.
See DemoWordCondProbJSON.java
for an example of using JSON objects as keys in Hadoop, for computing
conditional probabilities.
One caveat is that JSONObjectWritable
implements Writable,
not WriteableComparable: this is because JSON objects, in
general, do not have a natural sort order. However, keys for
MapReduce jobs in Hadoop need to
implement WritableComparable: the solution is to
extend JSONObjectWritable to
implement WritableComparable and define your own custom
comparator:
public class MyJSONObject extends JSONObjectWritable implements WritableComparable {
public int compareTo(Object obj) {
....
}
}
If your custom JSON Object is defined as an inner class, remember
to declare it static so the Hadoop runtime is able
to instantiate objects independent of an enclosing instance of the
outer class.
Important: You will also want to
override hashCode() so that relevant JSON objects get
sent to the same reducer. Or else write a
custom Partitioner.
Option 3: Tuple library in Cloud9
See DemoWordCountTuple.java
for a demo of the Tuple class used as an intermediate key in a
MapReduce job.
The structure of each tuple is dictated by a schema. Schemas are
defined by the Schema
class. Here a sample code fragment of how a schema is defined:
public static final Schema MYSCHEMA = new Schema();
static {
MYSCHEMA.addField("token", String.class, "");
MYSCHEMA.addField("int", Integer.class, new Integer(1));
}
The addField method allows you to insert a field and
specify default values. The following are valid field types:
- Basic Java primitives: Boolean, Integer, Long, Float, Double, String
- Classes that implement Writable
Once a schema has been defined, tuples can be instantiated in one of two ways:
// method 1: new Tuple with default values
Tuple tuple1 = MYSCHEMA.instantiate();
// method 2: new Tuple with specified values
Tuple tuple2 = MYSCHEMA.instantiate("test", 2);
Calling the instantiate() method without any
parameters creates a new Tuple with
default values. Alternatively, you can directly specify the values of
each field using instantiate(Object...), the overloaded
method that takes a variable number of Objects as parameters.
Once a tuple is created, fields can be modified using the
set method; field values can be retrieved using the
get method. You can refer to a field by its integer
index position, or by its field name: the first is faster, but the
second makes code more readable.
Since a Tuple implements WritableComparable, it can be used directly in Hadoop without any effort. The class automatically takes care of serializing and deserializing the object.
Another feature of the Tuple class is its ability to store special
symbols. Each field in the Tuple can either hold an Object of the
type defined by its Schema, or a special symbol String. The method
containsSymbol can be used to check if a field contains a
special symbol. If the field contains a special symbol,
get will return null. If the field does not
contain a special symbol, getSymbol will return
null.
What's the use of this feature? Say you had tuples that
represented count(a,b), where a and
b are tokens you observe (i.e., these are the joint
counts). There is often a need to compute count(a,*),
which corresponds to the marginal count. In this case, you can use a
special symbol to represent the *, and distinguish it
from the lexical token '*' (where asterisks actually
appeared as tokens). Refer to
edu.umd.cloud9.demo.DemoWordCondProbTuple for a
well-commented basic demo that uses this special symbol feature.
Also of interest
is ArrayListWritable,
which provides a Hadoop data type for storing a list of
homogeneous Writable elements. This class, combined
with Tuple, allows you to define arbitrarily complex data
structures.
Option 4: Custom types
Finally, you can roll out your own custom data type by coding it from scratch. For rapid prototyping, this may be too slow; but once you know what you want, you'll probably want to write custom data types.
For Writable, you'll need to implement these
two methods:
- public void write(DataOutput out) throws IOException
- public void readFields(DataInput in) throws IOException
They tell the object how to serialize and deserialize itself. In
addition, make sure you have defined a zero argument constructor:
that's how the runtime creates objects of the type. Finally, make
sure that readFields "resets" the object every time it is
called: the Hadoop runtime will reuse objects whenever possible, which
means that readFields will be called repeatedly. Make
sure the state of the object is cleared before deserialization.
WritableComparable is a sub-interface
of Writable, so in addition to everything above, you'll
need to implement compareTo, which defines the sort order
of the object.
Benchmarks
Ah, so why so many options? It all boils down to a speed versus convenience tradeoff. JSON (Option 2) and the tuple facilities in Cloud9 (Option 3) are relatively easy to get started with, but run slow. Writing custom types will make your code run much faster, but well... you have to write more code (and tweaking the code can be a pain if you don't quite have the data structure right the first time). So, I would suggest rapid prototyping in JSON or the tuple facilities; write custom types when you finally got your algorithm nailed down.
As a reference, here are some benchmarks comparing options 2, 3, and 4. The "complex" data structure is pretty simple: just a pair of integers. The sort order is also pretty straightforward: sort by first number, then by second number. I compared three different data types:
PairOfInts- a Tuple with two fields
- a JSON object with two fields
I ran three different tasks:
- Creating 2 million new objects. Each is populated with two random numbers between 0 and 1000. All objects are added to an ArrayList.
- Cloning all 2 million objects. All new objects are added to another ArrayList.
- Sorting the second ArrayList
And here are the results (all in seconds):
| PairOfInts | Tuple | JSON | |
| Creating objects | 0.609 | 3.319 | 4.472 |
| Cloning objects | 0.576 | 2.303 | 4.972 |
| Sorting list | 1.681 | 7.591 | 11.644 |
Times were arrived at by taking the average of 10 trials. Experiments were conducted on Aug 6, 2008 on a 2.6GHz MacBook Pro running Windows XP and Cygwin. (Yes, yes, I know its blasphemous to run XP on a MBP: I actually dual-boot... just happened to be in Windows while running these experiments.)
You may wonder, why is JSON so slow? The reason is that in the
current implementation provided by org.json, the
underlying implementation of JSONObject uses HashMaps. Tuple uses
arrays. Array accesses are much much faster than looking up keys in a
hash.
If you're curious, you can run these benchmarks yourself: see code
in the edu.umd.cloud9.io package.
Alternatives
Other things you might want to look at:
- Hadoop's record I/O classes, in
org.apache.hadoop.record. The downside is that it's a relatively heavyweight implementation, since you must first define record types in a data description language (DDL), and then use Hadoop's translator to automatically generate code. - Google's Protocol Buffers.