SQL-on-Hadoop : Is SQL the next big step for Hadoop?

 Sessions  Comments Off on SQL-on-Hadoop : Is SQL the next big step for Hadoop?
Aug 012013

Since early days the Hadoop community has made several attempts to stretch Hadoop beyond its role as a distributed programming framework. The key strength that Hadoop brings to the table is its ability to scale linearly. Can we combine this advantage of Hadoop with the efficiency of databases? What does it take to run SQL over Hadoop?

Running SQL-on-Hadoop implies accessing data from “within” Hadoop using SQL as the interface. Accomplishing this demands a significant re-architecture of the storage and compute infrastructures.

SQL-on-Hadoop also shifts Hadoop’s role from being a technology, viewed so far as complementary to databases into something that could compete with them. Its perhaps the single most significant feature that will help Hadoop find its way into more enterprises.

This will be highlighting some conceptual ideas of the different ways that SQL processors can be implemented atop Hadoop. I’ll be taking some examples of OSS and Research-ware products.


Srihari SQL HadoopSrihari currently heads the technology organization for ThoughtWorks India. He’s been a developer and architect for several enterprise applications with focus on building large scale systems based on service oriented architectures, domain specific languages etc. He is passionate about distributed systems and databases.

Building Hadoop Pipelines using Apache Crunch

 Sessions  Comments Off on Building Hadoop Pipelines using Apache Crunch
Jul 292013

Most of Hadoop processing is not composed of single job, but a chain of  jobs. Building and managing such a chain is quite tricky, and that’s why people start to look at  other MR frameworks like PIG. But then again you have to learn the new semantics. Apache Crunch aims at changing  this, why learn new semantics to do the same?

For developers this means more focus on solving our actual problems rather than wrestling with MapReduce/Pig/Hive. Crunch is available in Java and Scala and offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license. I will demonstrate how we can build chain of jobs in Crunch. Perform various operations like  join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.


Rahul Sharma is a Senior Developer for Mettl.com . He has 8 years of experience in the Software Industry and has worked on several projects using Java/J2EE as the primary technology. He has an inclination to open source technologies and likes to explore/delve into new frameworks. He is one  of Apache Crunch developers. He has spoken in Indic threads conference (Pune 12).