Since early days the Hadoop community has made several attempts to stretch Hadoop beyond its role as a distributed programming framework. The key strength that Hadoop brings to the table is its ability to scale linearly. Can we combine this advantage of Hadoop with the efficiency of databases? What does it take to run SQL over Hadoop?
Running SQL-on-Hadoop implies accessing data from “within” Hadoop using SQL as the interface. Accomplishing this demands a significant re-architecture of the storage and compute infrastructures.
SQL-on-Hadoop also shifts Hadoop’s role from being a technology, viewed so far as complementary to databases into something that could compete with them. Its perhaps the single most significant feature that will help Hadoop find its way into more enterprises.
This will be highlighting some conceptual ideas of the different ways that SQL processors can be implemented atop Hadoop. I’ll be taking some examples of OSS and Research-ware products.
