Harpreet Singh

 Speakers  Comments Off on Harpreet Singh
Aug 262013

Harpreet-SinghHarpreet Singh is an accomplished engineer/entrepreneur in Informatics and engineering domain with a record of achievements in scientific and business leadership roles. Over 14 years of industry and research experience including fundamental biotechnology research in area of big data, enterprise platform, cloud computing and preclinical (toxicology) drug discovery. As an entrepreneur, setup and led a multidisciplinary teams of 5 – 40 members building software products for US and German market.

Involved in research and all aspects of development and product management. Collaborated with University of Wisconsin, Madison on number of academic service contracts along with industry collaborations with EGI and Philips.


Building Enterprise Big Data Platform For 100TB Dataset

SQL-on-Hadoop : Is SQL the next big step for Hadoop?

 Sessions  Comments Off on SQL-on-Hadoop : Is SQL the next big step for Hadoop?
Aug 012013

Since early days the Hadoop community has made several attempts to stretch Hadoop beyond its role as a distributed programming framework. The key strength that Hadoop brings to the table is its ability to scale linearly. Can we combine this advantage of Hadoop with the efficiency of databases? What does it take to run SQL over Hadoop?

Running SQL-on-Hadoop implies accessing data from “within” Hadoop using SQL as the interface. Accomplishing this demands a significant re-architecture of the storage and compute infrastructures.

SQL-on-Hadoop also shifts Hadoop’s role from being a technology, viewed so far as complementary to databases into something that could compete with them. Its perhaps the single most significant feature that will help Hadoop find its way into more enterprises.

This will be highlighting some conceptual ideas of the different ways that SQL processors can be implemented atop Hadoop. I’ll be taking some examples of OSS and Research-ware products.


Srihari SQL HadoopSrihari currently heads the technology organization for ThoughtWorks India. He’s been a developer and architect for several enterprise applications with focus on building large scale systems based on service oriented architectures, domain specific languages etc. He is passionate about distributed systems and databases.

Building Hadoop Pipelines using Apache Crunch

 Sessions  Comments Off on Building Hadoop Pipelines using Apache Crunch
Jul 292013

Most of Hadoop processing is not composed of single job, but a chain of  jobs. Building and managing such a chain is quite tricky, and that’s why people start to look at  other MR frameworks like PIG. But then again you have to learn the new semantics. Apache Crunch aims at changing  this, why learn new semantics to do the same?

For developers this means more focus on solving our actual problems rather than wrestling with MapReduce/Pig/Hive. Crunch is available in Java and Scala and offers a higher level of flexibility than any of the current set of MapReduce tools under Apache license. I will demonstrate how we can build chain of jobs in Crunch. Perform various operations like  join, aggregation etc. Crunch is quite extensible so I can showcase how much easy it is to write and build a library of reusable custom functions for our pipelines.


Rahul Sharma is a Senior Developer for Mettl.com . He has 8 years of experience in the Software Industry and has worked on several projects using Java/J2EE as the primary technology. He has an inclination to open source technologies and likes to explore/delve into new frameworks. He is one  of Apache Crunch developers. He has spoken in Indic threads conference (Pune 12).

Big Data Search Simplified With Elasticsearch

 Sessions  Comments Off on Big Data Search Simplified With Elasticsearch
Jul 292013

Most modern applications generate large amounts of data in order to understand the needs and likes of their customers. However finding meaningful information from within this data is like finding a needle in a haystack. In this session we will look at some solutions that are being used currently for Big Data Search and then take a closer look at one of the frontrunners, Elasticsearch. Github, FourSquare, StumbleUpon, SoundCloud all use ElasticSearch to analyze and search through terabytes of data and millions of search requests.

In Elasticsearch we will be discussing:

  • What is ElasticSearch, how it works.
  • How ElasticSearch works to analyze data splitting a document into meaningful portions and indexing each of those portions separately. So whenever a new search request comes in, it knows what to find.
  • Features and advantages of ElasticSearch like built in sharding defaults, maintaining fail-safe node clusters, automatically adding a new node without having to reboot and so on.
  • Out of the box features for today’s applications like faceted search, reverse search using Percolators and pre-built Analyzers.

Manoj Mohan is a software developer at Intelligrape Software based in Noida, UP. He has worked on various technologies for applications ranging from building custom solutions in Grails to PhoneGap to GXT. He is always fastidious over the available frameworks. He loves tinkering with tools to get the most productivity with least hassles.

Essential Toolkit for an Aspiring Big Data Scientist

 Uncategorized  Comments Off on Essential Toolkit for an Aspiring Big Data Scientist
Jul 282013

Big Data and Data Science are among the hottest words in Technology world currently. Every company seems to be doing some really complex work with Big Data. In Silicon Valley, lot of organisations are racing against time to build products around Big Data. Big Data continues to be bigger, of-course more complex and even broader in scope. According to Gartner Emerging Technology Hype Cycle 2012, Big Data is in “Peak of Inflated Expectations” area.

With rise of Big Data, there is a new breed of highly paid and scarcely available Professionals : Data Scientists. Based upon my experience of working in the space since past 1.5 years, I would like to share my understanding of current state of art in terms of Technologies. This talk is intended for Software Professionals interested in gaining an overview of Technology stack in Big Data projects, skill-set required and how to work towards building these competencies. Broad agenda would be:

  • What’s a typical Big Data Project
  • Main Products and Technologies
    • How do they fit in at different stages,  Competitive Landscape
  • Competencies of a Data Scientist
    • Statistics, Distributed Programming, Machine Learning, Text Mining, Data Visualization, Data Ingestion
  • Toolkit (Demo of some of the tools and libraries)
    • R / Python / Java,  Map-Reduce / Storm, R Libraries / Mahout / WEKA, Hive / Pig, Pentaho / Tableau / Excel, Sqoop / Flume / Oozie
  • My Experiences in Big Data Project at European Bank

Narinder Kumar is a practicing Technologist, learning Entrepreneur and passionate Product Developer. I am part of IT Industry professionally since 1996. During this time period, I have worked across diverse Industries, different countries and performed different roles.  I am currently working for a Big Data project at ING Bank, Netherlands. My work includes different components of Hadoop, Machine Learning Algorithms, NoSQL Data stores and Cloud Frameworks. I am also Certified Trainer for Apache Hadoop trainings delivered by Cloudera…