At Think Big Analytics we’re focused on helping customers create measurable value from Big Data. Fast access to that data is becoming mandatory as enterprises move beyond the first phase of adoption of Big Data. Over the last few months a number of projects and products have been announced to address this need, both within Hadoop and in related architectures: Impala, Hadapt, Shark, Splice Machine, Drill, Redshift, Platfora, Instaview, and now Stinger to improve performance in core Hive.
The race is on. With these technologies all driving faster queries for Hadoop clusters, the winner is our customers. I view this as an example of healthy competition in the open source ecosystem, where the community is investing in a number of different technologies. It’s also worth noting that while there’s competition in the approaches, the open source contributions to projects like Hive, MapReduce, and YARN are likely to actually benefit all the open source Hadoop ecosystem-based technologies like Hive and Impala. I believe that open source efforts will gain traction and participation and ultimately become the dominant approach.
Getting Hive to run queries in under 30 seconds will be welcome progress. We’ve heard that future enhancements to Hive that leverage YARN could further reduce minimum query times to 10 seconds. Even given such 10-30 second queries, we believe Impala will continue to lead the way in faster performance, allowing for subsecond queries. The net effect is that there will be dramatic improvements in query technologies for Hadoop and Big Data. I think core Hive will get a lot better thanks to these initiatives. Stinger will spur investment in new approaches and lead tools like Impala to become more full featured more quickly, with scalable processing times and User Defined Functions. Customers will have more choices and better options. I predict Hive will remain a workhorse for production jobs and large scale processing but that Impala will become the go-to tool for fast analytics. I think it will allow most of the important analytic queries to run 10x faster. But if you ever need more horsepower, Hive will be right there, ready to run on the same cluster.
Why is this becoming so important? In the first phase of adoption, Scalability and Cost Containment, archival and ETL are dominant use cases and just being able to run queries at all is a killer app. But as organizations mature and tackle the more advanced stages of agile analytics and business innovation, it’s critical to have quick access to data of all sizes. Data analysts want to be able to run queries in a second when exploring information. Data scientists want to do that and to investigate complex data sets, finding outliers in a few seconds.
- Big Data Adoption Stages