Most of Hadoop processing is not composed of single job, but a chain of jobs. Building and managing such a chain is quite tricky, and that’s why people start to look at other MR frameworks like PIG. But then again you have to learn the new semantics. Apache Crunch aims at changing this, why learn new semantics to do the same?
For developers this means more focus on solving our actual problems rather than wrestling with MapReduce/Pig/Hive. Crunch is available in Java and Scala and offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license. I will demonstrate how we can build chain of jobs in Crunch. Perform various operations like join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.