Search This Blog

Loading...

Thursday, February 9, 2012

Updated version of elephants genealogy

The release manager of Hadoop 0.22 and my namesake Konstantin pointed out that my diagram has alignment problems. So, I have posted a latest version to the same post. Enjoy.

Wednesday, February 8, 2012

I wish I can draw like Scott Adams

Because this is a complete Dilbert strip. Please, Mr. Scott - make the next one like this? :)

I sweat - this is was the weirdest chat in my life (grammar and all that are original):

Me: Good morning ;)
A Person: no need to wink
AP: I don't like it
AP: not professional
Me: Am I winking?
Me: Interesting...
Me: sorry for the non-professional offense
AP: You emoitcons
Me: ah… I don't have any icons actually. It just looks like semicolon
followed by a parenthesis. Didn't know it is unprofessional. It
usually is in IT industry. In fact, it came from IT 
Me: Sorry, if you've found it weird or something. 
AP: just unprofessional.  Leave it at that.                          

Duh ;(

Saturday, January 14, 2012

What you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.

Hadoop is taking central stage in the discussions about processing of the large amount of unstructured data.

With raising the popularity of the system I found that people are really puzzled with all the multiplicity of Hadoop versions; the small, yet annoying differences introduced by different vendors; the frustration when vendors are trying to lock up their customers using readily available open source data analytic components on top of Hadoop, and on and on.

So, after explaining who was born from whom for the 3rd time - and I tell you, drawing neat pictures on a napkin in a coffee shop isn't my favorite activity - I put together this little diagram below. Click on it to inspect it in greater details. A warning: the diagram only includes more or less significant releases of Hadoop and Hadoop-derived systems available today. I don't want to waste any time on some obscure releases or branches which never been accepted at any significant level. The only exception is 0.21 which was a natural continuation of 0.20 and predecessor of recently released 0.22.




Some explanations for the diagram:
  • Green rectangles designate official Apache Hadoop releases openly available for anyone in the world for free
  • Black ovals show Hadoop branches that are not yet officially released by Apache Hadoop (or might not be released ever). However, they are usually available in the form of source code or tar-ball artifacts
  • Red ovals are for commercial Hadoop derivatives which might be based on Hadoop or use Hadoop as a part of custom systems (like in case of MapR). These derivatives can be or can be not compatible with Hadoop and Hadoop data processing stack.
Once you're presented with the view like this it is getting clear that there are two centers of the gravity in today's universe of elephants: 0.20.2 based releases and derivatives; and 0.22 based branches, future releases, and derivatives. Also, it becomes quite clear which are likely to be sucked into a black hole.

The transition from 0.20+ to 0.2[1,2] was real critical because of introduced true HDFS append, fault injection, and code injection for system testing. And the fact that 0.21 hasn't been released for a long time, creating an empty space in the high demand environment. Even after it did come out, it didn't get any traction in the community. Meanwhile, HDFS append was very critical for HBase to move forward, so 0.20.2-append has been created to support the effort. A quite similar story had happened to 0.22: two different release managers was trying to get it out: first gave up, but the second has actually succeeded in pulling an effort of a part of the community towards it.

As you can see, HDFS append wasn't available in an official Apache Hadoop release for some time (except for 0.21 with the earlier disclaimer). Eventually it has been merged into 0.20.205 (recently dubbed as Hadoop 1.0) and that allows HBase to be nicely integrated with the official Apache Hadoop without any custom patching process.

The release of 0.20.203 was quite significant because it provided a heavily tested Hadoop security, developed by Yahoo! Hadoop development team (known as HortonWorks nowadays). Bits and pieces of 0.20.203 - even before the official release - were absorbed by at least one commercial vendor to add corporate grade Kerberos security to their derivatives of Hadoop (as in case of Cloudera CDH3).

The diagram above clearly shows a few important gaps of the rest of commercial offerings:
  1. none of them supports Kerberos security (EMC, IBM, and MapR)
  2. unavailability of Hbase due to the lack of HDFS append in their systems (EMC, IBM). In case of MapR you end up using a custom HBase distributed by MapR. I don't want to make any speculation of the latter in this article.
Apparently, the vacuum of significant releases between 0.20 and 0.22 appeared to be a major urge for Hadoop PMC and now - just days after release of 1.0 - 0.22 got out. With 0.23 already going through release process, championed by HortonWorks team. That release brings in some interesting innovations like Federations and MapReduce 2.0.

Once current alpha of 0.23 (which might become Hadoop 2.0 or even Hadoop 3.0) is ready for the final release I would expect new versions of commercial distributions springing to live as it was the case before. At this point I will update the diagram :)

If you can imagine the variety of the other animals such as Pig, and Hive piling on top of Hadoop, you would get astonished by the complexity of inter-component relations and, more importantly, about intricacies of building a stable data processing stack. This is why another Apache project BigTop has been so important and popular ever since it sprung to life last year. Here you can read about Bigtop here or here.

Tuesday, January 3, 2012

Can SEI really teach you how to be Hadoop contributor?

Or of anything else for that matter?

I am kidding you not... I just got this email from SEI. In the interest of full disclosure - here it is:
To the attention of: <me>

The Software Engineering Institute (SEI) has been asked to conduct a sample survey of committers to the Hadoop Distributed File System. The results will be used to supplement existing documentation that can be used in providing guidance to HDFS contributors as well as support committers in preparing their own HDFS contributions.

You are part of a carefully chosen sample of HDFS committers for the survey. So your participation is necessary for the results to be accurate and useful. Answering all of the questions should take about 15 or 20 minutes. Any information that could identify you or your organization will be held in strict confidence by the SEI under promise of non disclosure.

You will find your personalized form on the World Wide Web at https://feedback.sei.cmu.edu/Hadoop_HDFS_2.asp?id=C8288. Please be sure to complete it at your earliest convenience -- right now if you can make the time. You may save your work at any time, and you may return to complete your form over more than one session if necessary for any reason. Everything will be encrypted for secure transfer and storage.
<....>
Now, let's follow the link and dig out some pearls which, I am sure, has to be in the work of such a venerable organization. What are they covering exactly?
  • Reducing unnecessary dependencies and propagation, e.g., identifying cyclic dependencies between classes in the source code 
  • Difficulty in managing data
  • Difficulty in managing namespaces
  • Identifying location of bugs
  • difficulty finding test suites
  • Communication between application
  • Reducing unnecessary dependencies and propagation
  • yada-yada-yada
Ah, I think I got the picture.... boring... 1534th research in a row on how to write effective code. Something, I like in particular:
  • "You are part of a carefully chosen sample of HDFS committers" - no shit, there's a plenty to select from, of course.
  • "Are you familiar with the (HDFS) Architectural Documentation at http://kazman.shidler.hawaii.edu/ArchDoc.html" - what? hawaii.edu? Are you kidding me? How the architectural docs for an ASF project ended up there? Has the design came from Hawaii? Or you could not found it where the project belongs - on Apache site?
Here's the news, my dear doctors from SEI: just try to sit and write the code, learn from others; grok the best gems written by bright practitioners. That's pretty much what it takes - one doesn't need nothing like CMMI in order to create great software. I will let myself to make even a stronger assertion: one needs processes in place to make a bunch of ineffective and inexperienced folks to produce something useless that can be later sold to an idiot customer with a lifetime of support fees attached.

Meanwhile, the reality is that today you see the ratio of three software "managers" graduated by US universities for every decent developer who doesn't need help in the day one to find his own butt with both hands, a GPS navigator, and a flashlight.

The main reason an open source software is thriving today and constantly kicking ass of companies with established processes is because people aren't afraid to fail nor to experiment on their own dime and time. In other words, they don't give a shit about CMU teaching them how to write great code - they just learn it in the field and then do what it takes by learning from others. You don't a formal training for that, clearly. Perhaps, Khan Academy is what really need.

You know that old saying "If you can't do a job - go to management; if you can't manage then teach". I would amend it by "...; if you can't teach - go to research of software processes".

Although, I won't be totally surprised to see some fat-ass book on how to contribute to Hadoop coming out from CMU very soon. And it might even become a best seller on Amazon or something. But I know for sure that by the time OSS community will be far away onto making the next great thing!

And the other day I shall tell the story of that grad student from Berkley who was all set to write the greatest benchmarking "solution" for Hadoop - that deserves a separate post, because the guy was learning from CMMI, I guess.

Am I too acidic today? Must be this damn sunny California weather or something.

Tuesday, December 27, 2011

New blog for Apache BigTop!

We've just kicked off new blog for BigTop project - Apache Hadoop data stack creation and validation.

Surprisingly, if got started with post on BigTop history

Blog is available from ASF Blog roller at https://blogs.apache.org/bigtop/ - bookmark it!

Sunday, December 11, 2011

Conception and validation of Hadoop BigData stack: putting the record straight.

With more and more people jumping on bandwagon of big data it is very settling to see that Hadoop is gaining momentum by a day.

Even most fascinating is too see how the idea of putting together a bunch of service components on top of Hadoop proper is getting more and more momentum. IT and software development professionals are getting better understanding about benefits that a flexible set of loosely coupled yet compatible components provides when one needs to customize data processing solution at scale.

The biggest problem for most businesses trying to add Hadoop infrastructure into their existing IT is a lack of knowledge, professional support, and/or clear understanding of what's out there on the market to help you. Essentially, Hadoop exists in one incarnation - this is the open-source project under the umbrella of Apache Software Foundation (ASF). This is where all the innovations in Hadoop are coming from. And essentially this is a source of profit for a few commercial offerings today.

What's wrong with the picture, you might ask? Well, the main issue with most of these "commercial offerings" are mostly two folds. They are either immature and based on an sometimes unfinished nor unreleased Hadoop code, or provide no significant value add compare to Hadoop proper available in source form from hadoop.apache.org. And no matter if any of above (or both of them together) apply to a commercial solution based on Hadoop, you can be sure of one thing: these solutions will cost you literally tons of money - as much as  $1k/node/year in some cases - for what is essentially available for free.

"What about neat packages I can get from a commercial provider and perhaps some training too?" one might ask. Well, yeah if you are willing to pay top bucks per node for say like this  to get fixed or learn how to install packages on a virtual machine - go ahead by all means.

However, keep in mind that you always can get a set of packages for Hadoop produced by another open source project called Bigtop, hosted by Apache. What essentially you get are packages for your Linux distro, which can be easily installed on your cluster's nodes. A great benefit is that you can easily trim your Hadoop stack to only include what you need: Hadoop + Hive, or perhaps Hadoop + HBase (which will automatically pick up Zookeper for you).

At any rate, the best part of the story isn't a set of packages that can be installed: after all this is what packages are usually being created for, right? The problem with the packages or other forms of component distribution is that you don't know in advance if A-package will nicely work with B-package v.1.2 unless some has tested this assumption before. Even then, testing environment  might be significantly different from your production environment and then all bets are off. Unless - again - you're willing to pay through your nose to someone who's willing to get it for you. And that's where true miracle of something like BigTop is coming for a rescue.

Before I'll explain more, I wanna step back a bit and take a look at some recent history. A couple of years ago Yahoo's Hadoop development team had to address an issue of putting together working and well-validated Hadoop stack including a number of components developed by different engineering organizations with their own development schedule and integration criteria. The main integration point of all of the pieces was the operations team which was in charge of big number of cluster deployments, provisioning and support. Without their own QA staff they were oftentimes at mercy of assumed code or configuration quality coming from all the corners of the company. Yet worst, even with a chance of the high quality of all these components there were no guarantees that they will work together as expected once put together on the cluster. And indeed, integration problems were many.

That's were a small team of engineers including yours truly put together a prototype of a system called FIT (Final Integration Testing). The system essentially allowed you to pick up a packaged component you want to validate against your cluster environment and perform the deployment, configuration, and testing with integration scenarios provided by either component's owner or your own team.

The approach was so effective that the project was continued and funded further in the form of HIT (Hadoop Integration Testing). At which point two of us have left for what seemed like a greener pasture back then :(

We thought the idea was real promising so we have continued on the path of developing a less custom and more adoptable technology based on open standards such as Maven and Groovy. Here you can find slides from the talk we gave at eBay about a year ago. The presentation is putting the concept of Hadoop data stack in open writing for the time, as well as stacks customization and validation technology. When this presentation were given we already had well working mechanism of creating, deploying, and validating both packaged and non-packaged Hadoop components.

BigTop - open-sourced for the second time just a few months and based on our project above - has added up a packaging creation layer on top of the stack validation product. This, of course, makes your life even easier. And even more so with a number of Puppet recipes allowing you to deploy and configure your cluster in highly efficient and automatic manner. I encourage you to check it out.

BigTop has been successfully used for validating release of Apache Hadoop 0.20.205 which has become a foundation of coming Hadoop 1.0.0 Another release of Hadoop - 0.22 - was using BigTop for release candidates validation and so on.

Sunday, October 23, 2011

Pointy-haired boss from Sun Microsystems...

No kidding - there was a manager back at my last job at Sun Microsystems whom I had to explain 4 times a week why you can't "just add a link to that page" of the enterprise application and have it available in 15 minutes.

I am kidding you not


Followers