I spent much of the day today hung on two very simple and unrelated problems while running an HBase MapReduce job whose mapper collects data stored as JSON from HBase and writes it to the context as Text (serialized JSON). The reducer deserializes the data and does some long running calculations on it.
The primary reason these calculations must be run in the reducer is because the HBase scanner lease times out after 60 seconds. So if you are doing much of anything which requires computation in the map phase, the scanner will timeout and the job will fail. Lame. The other reason is that the calculation being done in the reduce phase does not parallelize right now, so the job must collect all the data it needs in the map phase (allowing batching of the job in different tasks).
Hangup #1: When serializing Lists, the concrete implementation chosen by Gson to deserialize is LinkedList. Type reflection will not work to cast it as any other concrete List like ArrayList. This may just be the product of the specific version of Gson I’m using, but there is evidence of this here:
http://groups.google.com/group/google-gson/browse_thread/thread/903d164d76ca1115/3649e02e4e0dd9d5?fwc=1&pli=1
However, this thread implies that List casting is indeed possible
http://stackoverflow.com/questions/5813434/trouble-with-gson-serializing-an-arraylist-of-pojos
This worked for me locally, but not on our remote cluster. I am packaging all the libraries as a jar and plopping it on the DistributedCluster… perhaps another version of Gson exists on the cluster. I dont know yet (I just finished banging my head on this one.)
Ill say this though, if you are serializing and deserializing Lists, you might as well just cast it its base class (List) unless you are using something very special in the concrete List implementation.
OK - on to hangup #2: Passing variables around in MapReduce jobs
This one is probably obvious to many, but if you need your mappers or reducers to retain some sort of state variable, like say the environment under which they are executing, you can pass them around in the Configuration instance using the “set” method from the main job class. In particular I was setting the environment passed in from a command line flag, which was properly configuring the main class, but not the mappers and reducers. Many folks seem to solve this issue by bootstrapping all sorts of environment variables with Bash, but I really dont like this. It spreads out the application.
Loading posts...