Search This Blog

Monday, April 22, 2013

Hadoop 2: "alpha" elephant or not?

Today I will look into the state of Hadoop 2.x and try to understand what has kept it in the alpha state to date. Is it really an "alpha" elephant? This question keeps popping up on the Internet, in conversations with customers and business partners. Let's start with some facts first.

The first anniversary of the Hadoop 2.0.0-alpha release is around the corner. SHA 989861cd24cf94ca4335ab0f97fd2d699ca18102 was made on May 8th, 2012, marking the first-ever release branch of the Hadoop 2 line (in the interest of full disclosure: the actual release didn't happen until a few days later, May 23rd).[1]

It was a long-awaited event. And sure enough, the market accepted it enthusiastically.  The commercial vendor Cloudera announced its first Hadoop 2.x-based CDH4.0 at the end of June 2012, according to this statement from Cloudera's VPoP -- just a month after 2.0.0-alpha went live! So, was it solid, fact-based trust of the quality of the code base, or something else?  An interesting nuance: MapReduce v1 (MRv1) was brought back despite the presence of YARN (a new resource scheduler and a replacement for the old MapReduce). One of those things that make you go, "Huh...?"

We've just seen the 2.0.4-alpha RC vote getting closed: the fifth release in a row in just under one year. Many great features went in: YARN; HDFS HA; HDFS performance optimizations, to name a few. An incredible amount of stabilization has been done lately, especially in 2.0.4-alpha. Let's consider some numbers:

Table1: JIRAs committed to Hadoop between 2.0.0-alpha and 2.0.4-alpha releases
HADOOP383
HDFS801
MAPREDUCE219
YARN138
That's about 1,500 fixes and features since the beginning. Which was to be expected, considering the scope of implemented changes and the need for smoothing things out.

Let's for a moment look into Hadoop 1.x -- essentially the same old Hadoop 0.20.2xx -- per latest genealogy of elephants -- a well-respected and stable patriarchy. Hadoop 1.x had 8 releases altogether in 14 months:
  • 1.0.0 released on Dec 12, 2011
  • 1.1.2 released on Feb 15, 2013
Table2: JIRAs committed to Hadoop between 1.0.0 and 1.1.2 releases
HADOOP110
HDFS111
MAPREDUCE84
That's about five times fewer fixes and improvements than what went into Hadoop 1.x over roughly the same time. If frequency of change is any indication of stability, then perhaps we are onto something.

For the sake of full disclosure here's the similar statistics for Hadoop 0.23.x. There was 8 "dot"-releases between 01 Nov 2011 (0.23.0) and 16 Apr 2013 (0.23.7).
Table3: JIRAs committed to Hadoop between 0.23.0 and .0.23.7 releases
HADOOP514
HDFS687
MAPREDUCE1240
YARN92[2]

"Wow," one might say, "no wonder the 'alpha' tag has been so sticky!" Users definitely want to know if the core platform is turbulent and unstable. But wait... wasn't there that commercial release that happened a month after the first OSS alpha? If it was more stable than the official public alpha, then why did it take the latter another five releases and 1,500 commits to get where it is today? Why wasn't the stabilization simply contributed back to the community? Or, if both were of the same high quality to begin with, then why is the public Hadoop 2.x still wearing the "alpha" tag one year later?

Before moving any further: all 13 releases -- for 1.x and 2.x --  were managed by engineers from Hortonworks. Tipping my hat to those guys and all contributors to the code!

So, is Hadoop 2 that unstable after all? In the second part of this article I will dig into the technical merits of the new development line so we can decide for ourselves. To be continued...

References:
[1] All release info is available from official ASF Hadoop release page
[2] First appeared in release 0.23.3 

No comments:

Post a Comment

Followers