Building Hadoop Pipelines using Apache Crunch

 Sessions  Comments Off on Building Hadoop Pipelines using Apache Crunch
Jul 292013

Most of Hadoop processing is not composed of single job, but a chain of  jobs. Building and managing such a chain is quite tricky, and that’s why people start to look at  other MR frameworks like PIG. But then again you have to learn the new semantics. Apache Crunch aims at changing  this, why learn new semantics to do the same?

For developers this means more focus on solving our actual problems rather than wrestling with MapReduce/Pig/Hive. Crunch is available in Java and Scala and offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license. I will demonstrate how we can build chain of jobs in Crunch. Perform various operations like  join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.


Rahul Sharma is a Senior Developer for . He has 8 years of experience in the Software Industry and has worked on several projects using Java/J2EE as the primary technology. He has an inclination to open source technologies and likes to explore/delve into new frameworks. He is one  of Apache Crunch developers. He has spoken in Indic threads conference (Pune 12).