Moving Hadoop Beyond Batch with Apache YARN ⇢
Interesting post by Arun Murthy on Apache Hadoop 2.0 and YARN. He makes an interesting case about enabling SQL IN Hadoop rather than ON TOP of Hadoop. Very interesting to see Hadoop 2.0 characterized as a ‘multi-application operating system.’
Fast forward to today, and we see that Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive, analyzing data streams as they arrive, and more. And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.
Nothing illustrates this dynamic more clearly than the current industry noise around SQL on Hadoop. All kinds of vendors are clamoring to provide better SQL access to data stored in Hadoop – and so they should, since SQL is understood by many users. Since Apache Hive has been the defacto SQL interface to Hadoop data for many years, we’ve found most users would like to continue to leverage the power of Hive in support of these additional interactive SQL use cases.
But by building SQL access on top of Hadoop, it just highlights the challenge of Hadoop being a single application system. For when I run a SQL query on that data, it could consume all the resources of the cluster and cause performance issues for the other applications and jobs running in the cluster – not a good outcome to say the least.