Real-Time Query


As useful as Hive is, the latency of Hadoop means that each query in an interactive Hive session will take many seconds, which makes it difficult to explore and evolve ideas quickly. Impala is a new query engine that bypasses MapReduce for very fast queries over data sets in HDFS. It uses HiveQL as the query language. Impala is very new; the first production release is forthcoming. It currently doesn’t support all HiveQL features, but in many scenarios, speedups of 100x over Hive performance are already possible.

Spark and Shark

While very flexible, the MapReduce has a number of constraints that affect performance. Spark is a newer distributed computing framework that exploits sophisticated in-memory data caching to significantly improve many common data operations, sometimes by multiples of 10x. Shark is a port of Hive to Spark, bringing performance improvements to Hive queries that are comparable to the improvements Impala provides.

Google BigQuery

This is a web service that lets you do interactive analysis of massive datasets—up to billions of rows. It is the first publicly available access to Google's internal Big Data technology stack.