Harpreet Singh is an accomplished engineer/entrepreneur in Informatics and engineering domain with a record of achievements in scientific and business leadership roles. Over 14 years of industry and research experience including fundamental biotechnology research in area of big data, enterprise platform, cloud computing and preclinical (toxicology) drug discovery. As an entrepreneur, setup and led a multidisciplinary teams of 5 – 40 members building software products for US and German market.
Involved in research and all aspects of development and product management. Collaborated with University of Wisconsin, Madison on number of academic service contracts along with industry collaborations with EGI and Philips.
SessionsComments Off on SQL-on-Hadoop : Is SQL the next big step for Hadoop?
Since early days the Hadoop community has made several attempts to stretch Hadoop beyond its role as a distributed programming framework. The key strength that Hadoop brings to the table is its ability to scale linearly. Can we combine this advantage of Hadoop with the efficiency of databases? What does it take to run SQL over Hadoop?
Running SQL-on-Hadoop implies accessing data from “within” Hadoop using SQL as the interface. Accomplishing this demands a significant re-architecture of the storage and compute infrastructures.
SQL-on-Hadoop also shifts Hadoop’s role from being a technology, viewed so far as complementary to databases into something that could compete with them. Its perhaps the single most significant feature that will help Hadoop find its way into more enterprises.
This will be highlighting some conceptual ideas of the different ways that SQL processors can be implemented atop Hadoop. I’ll be taking some examples of OSS and Research-ware products.
SessionsComments Off on Building Hadoop Pipelines using Apache Crunch
Most of Hadoop processing is not composed of single job, but a chain of jobs. Building and managing such a chain is quite tricky, and that’s why people start to look at other MR frameworks like PIG. But then again you have to learn the new semantics. Apache Crunch aims at changing this, why learn new semantics to do the same?
For developers this means more focus on solving our actual problems rather than wrestling with MapReduce/Pig/Hive. Crunch is available in Java and Scala and offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license. I will demonstrate how we can build chain of jobs in Crunch. Perform various operations like join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.
SessionsComments Off on Big Data Search Simplified With Elasticsearch
Most modern applications generate large amounts of data in order to understand the needs and likes of their customers. However finding meaningful information from within this data is like finding a needle in a haystack. In this session we will look at some solutions that are being used currently for Big Data Search and then take a closer look at one of the frontrunners, Elasticsearch. Github, FourSquare, StumbleUpon, SoundCloud all use ElasticSearch to analyze and search through terabytes of data and millions of search requests.
In Elasticsearch we will be discussing:
What is ElasticSearch, how it works.
How ElasticSearch works to analyze data splitting a document into meaningful portions and indexing each of those portions separately. So whenever a new search request comes in, it knows what to find.
Features and advantages of ElasticSearch like built in sharding defaults, maintaining fail-safe node clusters, automatically adding a new node without having to reboot and so on.
Out of the box features for today’s applications like faceted search, reverse search using Percolators and pre-built Analyzers.
UncategorizedComments Off on Essential Toolkit for an Aspiring Big Data Scientist
Big Data and Data Science are among the hottest words in Technology world currently. Every company seems to be doing some really complex work with Big Data. In Silicon Valley, lot of organisations are racing against time to build products around Big Data. Big Data continues to be bigger, of-course more complex and even broader in scope. According to Gartner Emerging Technology Hype Cycle 2012, Big Data is in “Peak of Inflated Expectations” area.
With rise of Big Data, there is a new breed of highly paid and scarcely available Professionals : Data Scientists. Based upon my experience of working in the space since past 1.5 years, I would like to share my understanding of current state of art in terms of Technologies. This talk is intended for Software Professionals interested in gaining an overview of Technology stack in Big Data projects, skill-set required and how to work towards building these competencies. Broad agenda would be:
What’s a typical Big Data Project
Main Products and Technologies
How do they fit in at different stages, Competitive Landscape
Competencies of a Data Scientist
Statistics, Distributed Programming, Machine Learning, Text Mining, Data Visualization, Data Ingestion