Search This Blog

Thursday, April 25, 2013

Hadoop 2: "alpha" elephant or not? Part 2: features and opinions

In the first part of this article I looked into the development that brought us Hadoop 2. Let's now try to analyze whether Hadoop 2 is ready for general consumption, or if it’s all just a business hype at this point. Are you better off sticking to the old, not-that-energetic grandpa who, nonetheless, delivers every time or going riding with the younger fella who might be a bit "unstable"?

New features

Hadoop 2 introduces a few very important features such as
  • HDFS High Availability (HA) with . This is what it does:
    ...In order for the Standby node to keep its state synchronized with the Active node in this implementation, both nodes communicate with a group of separate daemons called JournalNodes…In the event of a fail-over, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a fail-over occurs.
    There's an alternative approach to HDFS HA that requires an external filer (an NAS or NFS server to store a copy of the HDFS edit logs). In the case of failure of the primary NameNode, a new one can be brought over and the network-stored copy of the logs can be used to serve the clients. This is essentially a less optimal approach than QJM, as it involves more moving parts and requires more complex dev.ops.

  • An HDFS federation that essentially allows to combine multiple namespaces/namenodes to a single logical filesystem. This allows for better utilization of the higher-density storage.

  • YARN essentially implements the concept of Infrastructure-As-A-Service. You can deploy your non-MR applications to cluster nodes using YARN resource management and scheduling.

    Another advantage is the split of the old JobTracker into two independent services: resource management and job scheduling. It gives a certain advantage in the case of a fail-over and in general is a much cleaner approach to MapReduce framework implementation. YARN is API-compatible with MRv1, hence you don't need to do anything about your MR applications, just perhaps recompile the code. Just run them on YARN.

Improvements

The majority of the optimizations were made on the HDFS side. Just a few examples:
  • overall file system read/write improvements: I've seen reports of >30% performance increase from 1.x to 2.x with the same workload
  • read improvements for DN and client collocation HDFS-347 (yet to be added to the 2.0.5 release)
Good overall observation on the HDFS road map can be found here

Vendors

Here's how the bets are spread among commercial vendors, with respect to supported production-ready versions:
Hadoop 1.xHadoop 2.x
Clouderax[1]x
Hortonworksx-
Intel-x
MapRx[1]x
Pivotal-x
Yahoo!-x[2]
WANdisco-x

The worldview of software stacks

In any platform ecosystem there are always a few layers: they are like onions; onions have layers ;)
  • in the center there's a core, e.g. OS kernel
  • there are few inner layers: the system software, drivers, etc.
  • and the external layers of the onion... err, the platform -- the user space applications: your web browser and email client and such
The Hadoop ecosystem isn't that much different from Linux. There's
  • the core: Hadoop
  • system software: Hbase, Zookeeper, Spring Batch
  • user space applications: Pig, Hive, users' analytics applications, ETL, BI tools, etc.
The responsibility of bringing all the pieces of the Linux onion together lies on Linux distribution vendors: Canonical, Redhat, SUSE, etc. They pull certain versions of the kernel, libraries, system and user-space software into place and release these collections to the users. But first they make sure everything fits nicely and add some of their secret sauce on top (think Ubuntu Unity, for example). Kernel maintenance is not a part of daily distribution vendors’ business. Yet they are submitting patches and new features. A set of kernel maintainers is then responsible to bring changes to the kernel mainline. Kernel advancements are happening under very strict guidelines. Breaking compatibility with user-space is rewarded by placing a guilty person straight into the 8th circle of Inferno.

Hadoop practices a somewhat different philosophy than Linux, though. Hadoop 1.x is considered stable, and only critical bug fixes are getting incorporated into it (Table2). Whereas Hadoop 2.x is moving forward at a higher pace and most improvements are going there. That comes with at a cost to user-space applications. The situation is supposedly addressed by labeling Hadoop 2 as 'alpha' for about a year now. On the other hand, such tagging arguably prevents user feedback from flowing into the development community. Why? Because users and application developers alike are generally scared away by the "alpha" label: they'd rather sit and wait until the magic of stabilization happens. In the meanwhile, they might use Hadoop 1.x.

And, unlike the Canonical or Fedora project, there's no open-source integration place for the Hadoop ecosystem. Or is there?

Integration

There are 12+ different components in the Hadoop stack (as represented by the BigTop project). All these are moving at their own pace and, more often than not, support both versions of Hadoop. This complicates the development and testing. It creates a large amount of issues for the integration of these projects. Just think about the variety of library dependencies and such that might all of a sudden be at conflict or have bugs (HADOOP-9407 comes to mind). Every component also comes with its own configuration, adding insult to injury for all the tweaks in Hadoop.

All this brings a lot of issues to the DevOps who need to install, maintain, and upgrade your average Hadoop cluster. In many cases, DevOps simply don't have the capacity or knowledge to build and test a new component of the stack (or a newer version of it) before bringing it to the production environment. Most of the smaller companies and application developers don't have the expertise to build and install multiple versions from the release source tarballs, configure and performance tune of the installation.

That's where software integration projects like BigTop come into the spotlight. BigTop was started by Roman Shaposhnik (ASF Bigtop, Chair PMC) and Konstantin Boudnik (ASF Bigtop, PMC) at the Yahoo! Hadoop team back in 2009-2010. It was a continuation of earlier work based on expertise in software integration and OS distributions. BigTop provides a versatile tool for creating software stacks with predefined properties, validates the compatibility of integral parts, and creates native Linux packaging to ease the installation experience.

BigTop includes a set of Puppet recipes -- an industry standard configuration management system -- that allows to spin up a Hadoop cluster in about 10 minutes. The cluster can be configured for Kerber'ized or non-secure environments. A typical release of BigTop looks like a stack's bill-of-materials and source code. It lets anyone quickly build and test a packaged Hadoop cluster with a number of typical system and user-space components in it. Most of the modern Hadoop distributions are using BigTop openly or under the hood, making BigTop a de facto integration spot for all upstream projects

Conclusions

Here's Milind Bhandarkar (Chief Architect at Pivotal):
As part of HAWQ stress and longevity testing, we tested HDFS 2.0 extensively, and subjected it to the loads it had never seen before. It passed with flying colors. Of course, we have been testing the new features in HDFS since 0.22! EBay was the first to test new features in HDFS 2.0, and I had joined Konstantin Schvachko to declare Hadoop 0.22 stable, when the rest of the community called it crazy. Now they are realizing that we were right.
YARN is known for very high stability. Arun Murthy - RM of all of 2.0.x-alpha releases and one of the YARN authors - in the 2.0.3-alpha release email:
# Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release - see here)
And there's this view that I guess is shared by a number of application developers and users sitting on the sidewalks:
I would expect to have a non-alpha semi-stable release of 2.0 by late June or early July.  I am not an expert on this and there are lots of things that could show up and cause those dates to slip.
In the meanwhile, six out of seven vendors are using and selling Hadoop 2.x-based versions of storage and data analytics solutions, system software, and service. Who is right? Why is the "alpha" tag kept on for so long? Hopefully, now you can make your own informed decision.

References:

[1]: EOLed or effectively getting phased out
[2]: Yahoo! is using Hadoop 0.23.x in production, which essentially is very close to the Hadoop 2.x source base

Monday, April 22, 2013

Hadoop 2: "alpha" elephant or not?

Today I will look into the state of Hadoop 2.x and try to understand what has kept it in the alpha state to date. Is it really an "alpha" elephant? This question keeps popping up on the Internet, in conversations with customers and business partners. Let's start with some facts first.

The first anniversary of the Hadoop 2.0.0-alpha release is around the corner. SHA 989861cd24cf94ca4335ab0f97fd2d699ca18102 was made on May 8th, 2012, marking the first-ever release branch of the Hadoop 2 line (in the interest of full disclosure: the actual release didn't happen until a few days later, May 23rd).[1]

It was a long-awaited event. And sure enough, the market accepted it enthusiastically.  The commercial vendor Cloudera announced its first Hadoop 2.x-based CDH4.0 at the end of June 2012, according to this statement from Cloudera's VPoP -- just a month after 2.0.0-alpha went live! So, was it solid, fact-based trust of the quality of the code base, or something else?  An interesting nuance: MapReduce v1 (MRv1) was brought back despite the presence of YARN (a new resource scheduler and a replacement for the old MapReduce). One of those things that make you go, "Huh...?"

We've just seen the 2.0.4-alpha RC vote getting closed: the fifth release in a row in just under one year. Many great features went in: YARN; HDFS HA; HDFS performance optimizations, to name a few. An incredible amount of stabilization has been done lately, especially in 2.0.4-alpha. Let's consider some numbers:

Table1: JIRAs committed to Hadoop between 2.0.0-alpha and 2.0.4-alpha releases
HADOOP383
HDFS801
MAPREDUCE219
YARN138
That's about 1,500 fixes and features since the beginning. Which was to be expected, considering the scope of implemented changes and the need for smoothing things out.

Let's for a moment look into Hadoop 1.x -- essentially the same old Hadoop 0.20.2xx -- per latest genealogy of elephants -- a well-respected and stable patriarchy. Hadoop 1.x had 8 releases altogether in 14 months:
  • 1.0.0 released on Dec 12, 2011
  • 1.1.2 released on Feb 15, 2013
Table2: JIRAs committed to Hadoop between 1.0.0 and 1.1.2 releases
HADOOP110
HDFS111
MAPREDUCE84
That's about five times fewer fixes and improvements than what went into Hadoop 1.x over roughly the same time. If frequency of change is any indication of stability, then perhaps we are onto something.

For the sake of full disclosure here's the similar statistics for Hadoop 0.23.x. There was 8 "dot"-releases between 01 Nov 2011 (0.23.0) and 16 Apr 2013 (0.23.7).
Table3: JIRAs committed to Hadoop between 0.23.0 and .0.23.7 releases
HADOOP514
HDFS687
MAPREDUCE1240
YARN92[2]

"Wow," one might say, "no wonder the 'alpha' tag has been so sticky!" Users definitely want to know if the core platform is turbulent and unstable. But wait... wasn't there that commercial release that happened a month after the first OSS alpha? If it was more stable than the official public alpha, then why did it take the latter another five releases and 1,500 commits to get where it is today? Why wasn't the stabilization simply contributed back to the community? Or, if both were of the same high quality to begin with, then why is the public Hadoop 2.x still wearing the "alpha" tag one year later?

Before moving any further: all 13 releases -- for 1.x and 2.x --  were managed by engineers from Hortonworks. Tipping my hat to those guys and all contributors to the code!

So, is Hadoop 2 that unstable after all? In the second part of this article I will dig into the technical merits of the new development line so we can decide for ourselves. To be continued...

References:
[1] All release info is available from official ASF Hadoop release page
[2] First appeared in release 0.23.3 

Friday, April 19, 2013

On coming fragmentation of Hadoop platform


I just read this interview with the CEO of HortonWorks in which he expresses a fear about Hadoop fragmentation. He calls attention to the valid issue in the Hadoop ecosystem where forking is getting to the point that product space is likely to get fragmented.


So why should the BigTop community bother? Well, for one, Hadoop is the core upstream component of the BigTop stack. By filling this unique position, it has a profound effect on downstream consumers such as HBase, Oozie, etc. Although projects like Hive and Pig can partially avoid potential harm by statically linking with Hadoop binaries, this isn't a solution for any sane integration approach. As a side note: I am especially thrilled by Hive's way of working around multiple incompatibilities in the MR job submission protocol. The protocol has been naturally evolving for quite some time, and no one could even have guaranteed compatibility in versions like 0.19 or 0.20. Anyway, Hive solved the problem by simply generating a job jar, constructing a launch string and then - you got it already, right? - System.exec()'ing the whole thing. On a separate JVM, that is! Don't believe me? Go check the source code yourself.


Anecdotal evidence aside, there's a real threat of fracturing the platform. And there's no good reason for doing so even if you're incredibly selfish, or stupid, or want to monopolize the market. Which, by the way, doesn't work for objective reasons even with so-called "IP protection" laws in place. But that's a topic for another day.


So, what's HortonWorks’ answer to the problem? Here it comes:

Amid current Hadoop developments---is there any company NOT launching a distribution with some value added software?---Hortonworks stands out. Why? Hortonworks turns over its entire distribution to the Apache open source project.
While it is absolutely necessary for any human endeavor to be collaborative in order to succeed, the open source niche might be a tricky one. There are literally no incentives for all players to play by the book, and there's always that one very bold guy who might say, "Screw you guys, I’m going home," because he is just... you know...


Where could these incentives come from? How can we be sure that every new release is satisfactory for everyone's consumption? How do we guarantee that HBase’s St.Ack and friends won't be spending their next weekend trying to fix HBase when it loses its marbles because of that tricky change in Hadoop’s behavior?


And here comes a hint of an answer:

We're building directly in the core trunk, productizing the package, doing QA and releasing.

I have a couple of issues with this statement. But first, a spoiler alert: I am not going to attack neither HortonWorks nor their CEO. I don't have a chip on my shoulder -- not even an ARM one. I am trying to demonstrate the fallacy in the logic and show what doesn't work and why. And now here's the laundry list:

  • "building directly in the core trunk": Hadoop isn't released from the trunk. This is a headache. And this is one of the issues that the BigTop community faced during the most recent stabilization exercise for the Hadoop 2.0.4-alpha release. Why's that a problem? Well, for one, there's a policy that "everything should go through the trunk". It means -- in context of Hadoop’s current state -- that you have to first commit to the trunk, then back-port to branch-2, which is supposed to be the landing ground for all Hadoop 2.x releases, just like branch-1 is the landing ground for all Hadoop 1.x releases. If it so happens that there's an active release(s) happening at the moment, one would need to back-port the commit to another release branch(es), such as 2.0.4-alpha in this particular example. Mutatis mutandis, some of the changes are reaching only about 2/3 of the way down. Best-case scenario. This approach also gives fertile ground to all "proponents" of open-source Hadoop because once their patches are committed to the trunk, they are as open-source as the next guy. They might get released in a couple of years, but hey -- what's a few months between friends, right?
  • "productizing the package": is Mr. Bearden aware of when development artifacts for an ongoing Hadoop release were last published in the open? ‘Cause I don't know of a publication of any such thing to date. Neither does Google, by the way. Even the official source tarballs weren't available until, like, 3 weeks ago. Why does that constitute a problem? How do you expect to perform any reasonable integration validation if you don't have an official snapshot of the platform? Once your platform package is "productized", it is a day late to pull your hair out. If you happen to find some issues -- come back later. At the next release, perhaps?
  • "doing QA and releasing": we are trying to build an open-source community here, right? Meaning that the code, the tests and their results, the bug reports, the discussions should be in the open. The only place where the Hadoop ecosystem is being tested at any reasonable length and depth is BigTop. Read here for yourself  And feel free to check the regular builds and test runs for _all_ the components that BigTop releases for both secured and non-secured configurations. What are you testing with and how, Mr. Bearden?

So, what was the solution? Did I miss it in the article? I don't think so. Because a single player -- even one as respected as HortonWorks -- can't solve the issue in question without ensuring that anything produced by the Hadoop project's developers is always in line with the expectations of downstream players.


That's how you prevent fracturing: by putting in the open a solid and well-integrated reference implementation of the stack - one that can be installed by anyone using open-standard packaging and loaded with third-party applications without tweaking them every time you go from Cloudera's cluster to MapR's. Or another pair of vendors’. Does it sound like I am against making money in open-source software? Not at all: most people in the OSS community do this on the dime of their employers or as part of their own business.


You can consider BigTop's role in the Hadoop centric environment to be similar to that of Debian in the Linux kernel/distribution ecosystem. By helping to close the gap between the applications and the fast-moving core of the stack, BigTop essentially brings reassurance of the Hadoop 2.x line's stability into the user space and community. BigTop helps to make sure that vendor products are compatible with each other and with the rest of the world; to avoid vendor lock-in and to guarantee that recent Microsoft stories will not be replayed all over again.


Are there means to achieve the goal of keeping the core contained? Certainly! BigTop does just that. Recent announcements from Intel, Pivotal, WANdisco are living proof of it: they all using BigTop as the integration framework and consolidation point. Can these vendors deviate even under such a top-level integration system? Sure. But this will be immensely harder to do.

Acceptance of open Hadoop stack: role of BigTop

I have just posted this article on ASF blog roller elaborating on why BigTop is becoming a center piece of integration focused on Hadoop-based data analytically stack. Enjoy.

Thursday, April 18, 2013

Dealing with noisy fan of Lenovo ThinkPad T430

I recently got my shiny new ThinkPad T430 beefed up with 16GB of RAM, 180GBSSD, 1TB extra disk and many other good things. I really enjoying it and can run multinode virtual Hadoop cluster while developing something in my favorite IntelliJ IDEA. Did I mention already that I feel sorry for Apple PowerBook users?

Anyway, I was unlucky enough to get this particular machine with a faulty fan. It has been reported widely that due to some QC issues at Lenovo or whatnot a number of the laptops (T420/T430 alike) is coming with the fan that has especially maddening pitch at around 3200-3400 rpms. It is one of those sounds that drills your skull and drives you buts in about 17 seconds from the start of it.

If it by itself wasn't bad enough, the BIOS sets the fan to that speed when the temperature is anywhere above like 36 degrees C. If you have the laptop on your laps - that's where you would expect it to have, judging by the name - you're doomed.

There're lots of complains from people about this and attempts to solve it with clumsy scripts written in Python, etc. However, the working solution was under my nose pretty much all that time, waiting for me on Gentoo Wiki.

It also works on Ubuntu 12.04 like a charm. The only modification you need to be aware of on Ubuntu is how to add thinkfan to your machine's startup sequence. You'd need to run
sudo update-rc.d thinkfan defaults
instead of the the page above suggest. It is so quite here again!

Update: in all fairness, I had to eventually replace the fan, because it was still too noisy in the above speeds, although, it didn't get engage too often. Now, it is completely silent ;)

Followers