In the
first part of this article I looked into the development that brought us Hadoop 2.
Let's now try to analyze whether Hadoop 2 is ready for general
consumption, or if it’s all just a business hype at this point. Are you
better off sticking to the old, not-that-energetic grandpa who,
nonetheless, delivers every time or going riding with the younger fella
who might be a bit "unstable"?
New features
Hadoop 2 introduces a few very important features such as
- HDFS High Availability (HA) with . This is what it does:
...In
order for the Standby node to keep its state synchronized with the
Active node in this implementation, both nodes communicate with a group
of separate daemons called JournalNodes…In the event of a fail-over, the
Standby will ensure that it has read all of the edits from the
JournalNodes before promoting itself to the Active state. This ensures
that the namespace state is fully synchronized before a fail-over
occurs.
There's an alternative approach to HDFS HA that requires an external filer (an
NAS or NFS server to store a copy of the HDFS edit logs). In the case of
failure of the primary NameNode, a new one can be brought over and the
network-stored copy of the logs can be used to serve the clients. This
is essentially a less optimal approach than QJM, as it involves more
moving parts and requires more complex dev.ops.
-
An HDFS federation
that essentially allows to combine multiple namespaces/namenodes to a
single logical filesystem. This allows for better utilization of the
higher-density storage.
- YARN essentially implements the concept of Infrastructure-As-A-Service. You can deploy your non-MR applications to cluster nodes using YARN resource management and scheduling.
Another advantage is the split of the old JobTracker into two
independent services: resource management and job scheduling. It gives a
certain advantage in the case of a fail-over and in general is a much
cleaner approach to MapReduce framework implementation. YARN is
API-compatible with MRv1, hence you don't need to do anything about your
MR applications, just perhaps recompile the code. Just run them on
YARN.
Improvements
The majority of the optimizations were made on the HDFS side. Just a few examples:
-
overall file system read/write improvements: I've seen reports of
>30% performance increase from 1.x to 2.x with the same workload
- read improvements for DN and client collocation HDFS-347 (yet to be added to the 2.0.5 release)
Good overall observation on the HDFS road map can be
found here
Vendors
Here's how the bets are spread among commercial vendors, with respect to supported production-ready versions:
| Hadoop 1.x | Hadoop 2.x |
| Cloudera | x[1] | x |
| Hortonworks | x | - |
| Intel | - | x |
| MapR | x[1] | x |
| Pivotal | - | x |
| Yahoo! | - | x[2] |
| WANdisco | - | x |
The worldview of software stacks
In any platform ecosystem there are always a few layers: they are like onions; onions have layers ;)
- in the center there's a core, e.g. OS kernel
- there are few inner layers: the system software, drivers, etc.
- and
the external layers of the onion... err, the platform -- the user space
applications: your web browser and email client and such
The Hadoop ecosystem isn't that much different from Linux. There's
- the core: Hadoop
- system software: Hbase, Zookeeper, Spring Batch
- user space applications: Pig, Hive, users' analytics applications, ETL, BI tools, etc.
The responsibility of bringing all the pieces of the Linux onion together
lies on Linux distribution vendors: Canonical, Redhat, SUSE, etc. They
pull certain versions of the kernel, libraries, system and user-space
software into place and release these collections to the users. But
first they make sure everything fits nicely and add some of their secret
sauce on top (think Ubuntu Unity, for example). Kernel maintenance is
not a part of daily distribution vendors’ business. Yet they are
submitting patches and new features. A set of kernel maintainers is then
responsible to bring changes to the kernel mainline. Kernel
advancements are happening under very strict guidelines. Breaking
compatibility with user-space is rewarded by placing a guilty person
straight into the 8th circle of
Inferno.
Hadoop practices a somewhat different philosophy than Linux, though. Hadoop
1.x is considered stable, and only critical bug fixes are getting
incorporated into it (
Table2).
Whereas Hadoop 2.x is moving forward at a higher pace and most
improvements are going there. That comes with at a cost to user-space
applications. The situation is supposedly addressed by labeling Hadoop 2
as 'alpha' for about a year now. On the other hand, such tagging
arguably prevents user feedback from flowing into the development
community. Why? Because users and application developers alike are
generally scared away by the "alpha" label: they'd rather sit and wait
until the magic of stabilization happens. In the meanwhile, they might
use Hadoop 1.x.
And, unlike the Canonical or Fedora project, there's no open-source integration place for the Hadoop ecosystem. Or is there?
Integration
There are 12+ different components in the Hadoop stack (as represented by the
BigTop project). All these are moving at their own pace and, more often
than not, support both versions of Hadoop. This complicates the
development and testing. It creates a large amount of issues for the
integration of these projects. Just think about the variety of library
dependencies and such that might all of a sudden be at conflict or have
bugs (
HADOOP-9407 comes to mind). Every component also comes with its own configuration, adding insult to injury for all the tweaks in Hadoop.
All this brings a lot of issues to the DevOps who need to install,
maintain, and upgrade your average Hadoop cluster. In many cases, DevOps
simply don't have the capacity or knowledge to build and test a new
component of the stack (or a newer version of it) before bringing it to
the production environment. Most of the smaller companies and
application developers don't have the expertise to build and install
multiple versions from the release source tarballs, configure and
performance tune of the installation.
That's where software integration projects like
BigTop
come into the spotlight. BigTop was started by Roman Shaposhnik (ASF
Bigtop, Chair PMC) and Konstantin Boudnik (ASF Bigtop, PMC) at the
Yahoo! Hadoop team back in 2009-2010. It was a continuation of earlier
work based on expertise in software integration and OS distributions.
BigTop provides a versatile tool for creating software stacks with
predefined properties, validates the compatibility of integral parts,
and creates native Linux packaging to ease the installation experience.
BigTop includes a set of Puppet recipes -- an industry standard configuration
management system -- that allows to spin up a Hadoop cluster in about 10
minutes. The cluster can be configured for Kerber'ized or non-secure
environments. A typical release of BigTop looks like a stack's
bill-of-materials and source code. It lets anyone quickly build and test
a packaged Hadoop cluster with a number of typical system and
user-space components in it. Most of the modern Hadoop distributions are
using BigTop openly or under the hood, making BigTop a de facto
integration spot for all upstream projects
Conclusions
Here's Milind Bhandarkar (Chief Architect at Pivotal):
As part of HAWQ stress and longevity testing, we tested HDFS 2.0
extensively, and subjected it to the loads it had never seen before. It
passed with flying colors. Of course, we have been testing the new
features in HDFS since 0.22! EBay was the first to test new features in
HDFS 2.0, and I had joined Konstantin Schvachko to declare Hadoop 0.22
stable, when the rest of the community called it crazy. Now they are
realizing that we were right.
YARN is known for very high stability. Arun Murthy - RM of all of 2.0.x-alpha releases and one of the YARN authors -
in the 2.0.3-alpha release email:
# Significant stability at scale for YARN (over 30,000 nodes and 14
million applications so far, at time of release - see here)
And there's
this view that I guess is shared by a number of application developers and users sitting on the sidewalks:
I would expect to have a non-alpha semi-stable release of 2.0 by late
June or early July. I am not an expert on this and there are lots of
things that could show up and cause those dates to slip.
In the meanwhile, six out of seven vendors are using and selling Hadoop
2.x-based versions of storage and data analytics solutions, system
software, and service. Who is right? Why is the "alpha" tag kept on for
so long? Hopefully, now you can make your own informed decision.
References:
[1]: EOLed or effectively getting phased out
[2]: Yahoo! is using Hadoop 0.23.x in production, which essentially is very close to the Hadoop 2.x source base