Cloud9: Working with complex data types

by Jimmy Lin

(Page first created: 30 Oct 2007; last updated:

Introduction

In Hadoop, the two basic data types are:

  • WritableComparable is the base interface for keys, and
  • Writable is the base class interface for values.

Classes implementing the above two interfaces provide your basic primitives:

  • IntWriteable (ints),
  • Text (strings),
  • BytesWritable (raw bytes),
  • etc.

So what do you do when you need to work with more complex data types? Well, here are your options:

  • Option 1: hack it — treat everything as text, parse it manually (this gets ugly real quick).
  • Option 2: use the Cloud9 bindings for JSON — easy to use, but runs pretty slow.
  • Option 3: use the Cloud9 Tuple library — runs faster than using JSON.
  • Option 4: roll your own custom data type — runs much faster (up to an order of magnitude), but I wouldn't recommend this if you're just starting out since it requires writing more code.
  • Alternatives: other things to look at.

At the very end, I'm going to discuss some benchmarks comparing options 2, 3, and 4.

Option 1: Hack it yourself

The simplest way to represent complex data is to encode it into Text. For example, you could represent lists as comma delimited values. Once you read the text (e.g., in a mapper), split the string on commas and reconstruct the individual list elements. Before emitting a key or a value, encode the list back into Text in the same way. The downside of this technique, of course, is an ugly mess. This is a quick way to get started with Hadoop, but you'll want to quickly move past this phase.

Option 2: JSON bindings in Cloud9

JSON (JavaScript Object Notation) is a lightweight, text-based data interchange format for representing simple data structures. The Wikipedia article about JSON and json.org are good places for more information. The basic idea is to use JSON objects as complex data structure inside Hadoop. Included with Cloud9 is Java support for JSON in the org.json package.

As an example, the following JSON object:

 {
     "firstName": "John",
     "lastName": "Smith",
     "address": {
         "streetAddress": "21 2nd Street",
         "city": "New York",
         "state": "NY",
         "postalCode": 10021
     },
     "phoneNumbers": [
         "212 555-1234",
         "646 555-4567"
     ]
 }

Can be built with the following code snippet:

JSONObject obj = new JSONObject();

obj.put("firstName", "John");
obj.put("lastName", "Smith");
		 
JSONObject address = new JSONObject();
address.put("streetAddress", "21 2nd Street");
address.put("city", "New York");
address.put("state", "NY");
address.put("postalCode", 10021);

JSONArray phoneNumbers = new JSONArray();
phoneNumbers.put("212 555-1234");
phoneNumbers.put("646 555-4567");

obj.put("address", address);
obj.put("phoneNumbers", phoneNumbers);

Cloud9 provides support for JSON via JSONObjectWritable, which extends JSONObject to provide serialization and deserialization capabilities. See DemoWordCondProbJSON.java for an example of using JSON objects as keys in Hadoop, for computing conditional probabilities.

One caveat is that JSONObjectWritable implements Writable, not WriteableComparable: this is because JSON objects, in general, do not have a natural sort order. However, keys for MapReduce jobs in Hadoop need to implement WritableComparable: the solution is to extend JSONObjectWritable to implement WritableComparable and define your own custom comparator:

public class MyJSONObject extends JSONObjectWritable implements WritableComparable {
    public int compareTo(Object obj) {
        ....
    }
}

If your custom JSON Object is defined as an inner class, remember to declare it static so the Hadoop runtime is able to instantiate objects independent of an enclosing instance of the outer class.

Important: You will also want to override hashCode() so that relevant JSON objects get sent to the same reducer. Or else write a custom Partitioner.

Option 3: Tuple library in Cloud9

See DemoWordCountTuple.java for a demo of the Tuple class used as an intermediate key in a MapReduce job.

The structure of each tuple is dictated by a schema. Schemas are defined by the Schema class. Here a sample code fragment of how a schema is defined:

public static final Schema MYSCHEMA = new Schema();
static {
    MYSCHEMA.addField("token", String.class, "");
    MYSCHEMA.addField("int", Integer.class, new Integer(1));
}

The addField method allows you to insert a field and specify default values. The following are valid field types:

  • Basic Java primitives: Boolean, Integer, Long, Float, Double, String
  • Classes that implement Writable

Once a schema has been defined, tuples can be instantiated in one of two ways:

// method 1: new Tuple with default values
Tuple tuple1 = MYSCHEMA.instantiate();

// method 2: new Tuple with specified values
Tuple tuple2 = MYSCHEMA.instantiate("test", 2);

Calling the instantiate() method without any parameters creates a new Tuple with default values. Alternatively, you can directly specify the values of each field using instantiate(Object...), the overloaded method that takes a variable number of Objects as parameters.

Once a tuple is created, fields can be modified using the set method; field values can be retrieved using the get method. You can refer to a field by its integer index position, or by its field name: the first is faster, but the second makes code more readable.

Since a Tuple implements WritableComparable, it can be used directly in Hadoop without any effort. The class automatically takes care of serializing and deserializing the object.

Another feature of the Tuple class is its ability to store special symbols. Each field in the Tuple can either hold an Object of the type defined by its Schema, or a special symbol String. The method containsSymbol can be used to check if a field contains a special symbol. If the field contains a special symbol, get will return null. If the field does not contain a special symbol, getSymbol will return null.

What's the use of this feature? Say you had tuples that represented count(a,b), where a and b are tokens you observe (i.e., these are the joint counts). There is often a need to compute count(a,*), which corresponds to the marginal count. In this case, you can use a special symbol to represent the *, and distinguish it from the lexical token '*' (where asterisks actually appeared as tokens). Refer to edu.umd.cloud9.demo.DemoWordCondProbTuple for a well-commented basic demo that uses this special symbol feature.

Also of interest is ArrayListWritable, which provides a Hadoop data type for storing a list of homogeneous Writable elements. This class, combined with Tuple, allows you to define arbitrarily complex data structures.

Option 4: Custom types

Finally, you can roll out your own custom data type by coding it from scratch. For rapid prototyping, this may be too slow; but once you know what you want, you'll probably want to write custom data types.

For Writable, you'll need to implement these two methods:

  • public void write(DataOutput out) throws IOException
  • public void readFields(DataInput in) throws IOException

They tell the object how to serialize and deserialize itself. In addition, make sure you have defined a zero argument constructor: that's how the runtime creates objects of the type. Finally, make sure that readFields "resets" the object every time it is called: the Hadoop runtime will reuse objects whenever possible, which means that readFields will be called repeatedly. Make sure the state of the object is cleared before deserialization.

WritableComparable is a sub-interface of Writable, so in addition to everything above, you'll need to implement compareTo, which defines the sort order of the object.

Benchmarks

Ah, so why so many options? It all boils down to a speed versus convenience tradeoff. JSON (Option 2) and the tuple facilities in Cloud9 (Option 3) are relatively easy to get started with, but run slow. Writing custom types will make your code run much faster, but well... you have to write more code (and tweaking the code can be a pain if you don't quite have the data structure right the first time). So, I would suggest rapid prototyping in JSON or the tuple facilities; write custom types when you finally got your algorithm nailed down.

As a reference, here are some benchmarks comparing options 2, 3, and 4. The "complex" data structure is pretty simple: just a pair of integers. The sort order is also pretty straightforward: sort by first number, then by second number. I compared three different data types:

  • PairOfInts
  • a Tuple with two fields
  • a JSON object with two fields

I ran three different tasks:

  • Creating 2 million new objects. Each is populated with two random numbers between 0 and 1000. All objects are added to an ArrayList.
  • Cloning all 2 million objects. All new objects are added to another ArrayList.
  • Sorting the second ArrayList

And here are the results (all in seconds):

PairOfIntsTupleJSON
Creating objects0.6093.3194.472
Cloning objects0.5762.3034.972
Sorting list1.6817.59111.644

Times were arrived at by taking the average of 10 trials. Experiments were conducted on Aug 6, 2008 on a 2.6GHz MacBook Pro running Windows XP and Cygwin. (Yes, yes, I know its blasphemous to run XP on a MBP: I actually dual-boot... just happened to be in Windows while running these experiments.)

You may wonder, why is JSON so slow? The reason is that in the current implementation provided by org.json, the underlying implementation of JSONObject uses HashMaps. Tuple uses arrays. Array accesses are much much faster than looking up keys in a hash.

If you're curious, you can run these benchmarks yourself: see code in the edu.umd.cloud9.io package.

Alternatives

Other things you might want to look at:

  • Hadoop's record I/O classes, in org.apache.hadoop.record. The downside is that it's a relatively heavyweight implementation, since you must first define record types in a data description language (DDL), and then use Hadoop's translator to automatically generate code.
  • Google's Protocol Buffers.

Back to main page

Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!