Search This Blog

Tuesday, September 25, 2012

Moot promises or just a damage control?

I have came across this post from Platfora which, among other trivialities, says:
Hadoop is irresistible for this reason, but the big question that remains is how to use the data there once you’ve stored it. The challenge is that Hadoop is a very different architecture to traditional data warehouses. It is a batch engine — a lumbering freight train that can process immense amounts of data, but takes a while to get up to speed, so even the simplest question requires minutes of processing.
How lyrical! And then we got a glimpse of The Promised Land laying ahead:
Here at Platfora we are laser focused on this next phase of Hadoop. The result won’t just match the status quo, but exceed it in flexibility and the ability to scale and adapt to changing requirements. Exciting times are ahead – stay tuned. 
No, wait - not an exactly promised land: just a promise of one. I wonder if this an attempt to damage control of yesterday's announcement about a vendor's support for Spark platform, that I was discussing in my last post? :)

Monday, September 24, 2012

BigData platform space is getting hotter

Skimming through my emails today I have came across this interesting post on general@hadoop list:

From MTG dev <>
SubjectLightning fast in-memory analytics on HDFS
DateMon, 24 Sep 2012 16:31:56 GMT
Because a lot of people here are using HDFS day in and day out the
following might be quite interesting for some.

Magna Tempus Group has just rolled out a readily available Spark 0.5
( packaged for Ubuntu distribution.  Spark delivers up
to 20x faster experience (sic!) using in-memory analytics and a computational
model that is different from MapReduce.

You can read the rest here. If you don't know about Spark then you sure should check the Spark project website and see how cool is that. If you are lazy to dig through the information, here's a brief summary for you (taken from the original poster's Magna Tempus Group website)
  • consists of a completely separate codebase optimized for low latency, although it can load data from any Hadoop input source, S3, etc.
  • doesn’t have to use Hadoop, actually
  • provides a new, highly efficient computational model, with programming interfaces in Scala, Java. We might start working soon on adding Groovy API to the set
  • offers a lazy evaluation that allows a “postponed” execution of operations
  • can do in-memory caching of data for later high-performance analytics. Yeah, go shopping for more RAM, gents!
  • can be run locally on a multicore system or on a Mesos cluster
Yawn, some might say. There are Apache Drill and other things that seems to be highly promising and all. Well, not so fast.

To begin with, I am not aware about any productized version of Drill (merged with Open Dremel or vice versa). Perhaps, there are some other technologies around that are 20x faster than Hadoop - I just haven't heard about them, so please feel free to correct me on this.

Also, Spark and some of its components (Mesos resource planner and such) have been happily adopted by interesting companies such as Twitter and so on.

What is not said out right is that an adoption of new in-memory high-performance analytics for big data by commercial vendors like Magna Tempus Group opens a completely new page in the BigData storybook.

I would "dare" to go as far as to assert that this new development means that Hadoop isn't the smartest kid on the block anymore - there are other faster and perhaps clever fellas moving in.

And I can't help but wonder if the Spark has lit a fire under the yellow elephant yet?

Wednesday, August 15, 2012

My talk on Hadoop distro diversity at BayArea HUG (Aug'12)

I have been giving this talk about Apache BigTop project and how it changes the landscape and competition for Hadoop distribution vendors, ISPs and ASVs.

The slides are available from here and I will update this post once the video is published by good folks from Yahoo!

Friday, July 27, 2012

Even elephants can't stand Forbe's articles

I just came across this article in Forbes full of trivialities about Hadoop platform. And the I came across this picture of a baby elephant who perhaps had read the same article over its morning fest

Poor baby - it is allergic to the bullshit, apparently.

Sunday, July 8, 2012

BigTop is coming to town: big time.

A very insightful article has been just posted by my good friend Rvs on Apache BigTop's official blog.

And I think you just should go and read it if you care to understand why stacks are so much important and how BigTop helps to ease the life for people who are truing to write something more complex than Tic-Tac-Toe game for a smartphone.

Sunday, June 17, 2012

HortonWorks is using BigTop: no more secrets!

As my former colleague John Kreisa nicely put in the HortonWorks 1.0 release announcement here (my warmest regards and best wishes to you guys!):
Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.
And indeed. I have asked this questions about HortonWorks using BigTop to power up their platform offering some time ago and later pretty much repeated it in the form of comment to Shaun Connolly blog. To his credit, my question has been answered directly:
As far as BigTop goes, we at Hortonworks are using parts of BigTop for the HDP platform builds, so thanks for the efforts there!
I have meet the gentleman in person at the recent Hadoop Summit and we have a short yet nice chat about enterprise stacks and the role an open-source technology plays there.

So, it is time to put my initial question to rest as the fully answered one.

P.S. On a separate note: I have left a slightly different comment on Cloudera' blog. Somehow, the comment doesn't appear to be visible (at least I don't see anything but "2 comments" line) nor had it been answered publicly (again, perhaps, it has been but I don't see in on the page). In the Cloudera's defense I have to say that I got an answering email from one of their execs, which I can't publish for it was a private message.

Baby Hadoop on the balls. Well, just one ball, technically...

I think this is awesome, really ;) Warms my heart and all that!

Saturday WTF: I hope "the worst manager ever" seen this strip

While it isn't technically Saturday anymore ('cause it is already 2 minutes into the Sunday), it is totally belongs to my "Saturday WTF"

because there's no way to explain W(hy)TF the worst manager can't do better than 20 minutes? I am sure he can easily do 28 or ever 32!

Sunday, June 10, 2012

Mutt can work with non-standard imap servers

Thanks to sid-cypher I now can use my favorite mutt with screwed up imap implementations such as one from The trick to avoid forever "Login in..." is to set
  set imap_pipeline_depth=0
for the particular account. And here it is - works as a charm!

Saturday, June 2, 2012

Saturday's WTF: Hadoop certification for coffee grinders

This one just jumped on me out of nowhere
Now, W(here)TF is Cloudera? Don't they have any idea how to fix coffee grinders?

Monday, May 28, 2012

LaTeX is cool and easy

Well, last week I was working on a document that needed to be neatly formatted and I didn't want to go with the usual chores with WISIVIG environment. Yes, I am a big fan of OpenOffice, but these guys have some drawbacks when it gets to the accurate representation of your work once printed.

Anyway, I decided to give LaTeX a try. I've been avoiding this thing since my time in the grad school: I thought it is crazy. a mess and hard to learn. Well, either I got much smarter in the last 20 years or Mr. Knuth is an incredibly clever person, or likely the both ;)

It took me about twenty minutes to get everything installed and running (thank you Ubuntu for packaging up TexMaker!) and the draft version of my paper neatly formatted. Well, I have spent another two hours polishing it to where I finally liked it, but this is beyond the point - I have high standards :)

Did I fell in love with Tex? No, I don't think so, but it turned out that programming your papers is fun and let you have a professional looking documents with ease.

Thursday, May 3, 2012

HortonWorks distribution: secretly powered by iTest and BigTop?

As I've mentioned here, and here BigTop is a real neat concept that I and a couple of my friends have put together a couple of years ago. Interestingly enough, second incarnation of the concept (known as iTest) made (back then) my manager to accusing me of stealing software in favor of ASF, followed by forceful departure from Cloudera (kick in the ass, with 'I am sorry' kinda smile on their face). But this all this in the past. The present however is much more interesting...

Some evidences were found that BigTop project (see first two links above) is seen as a power behind commercial offerings of some of the leading Hadoop vendors. Here how interesting it gets:

  • you can get HortonWorks' stack AMI from this link to play with and learn about Hadoop and stuff.
  • now let's see what they are using to power-up their distro
$ grep -i bigtop     --extra-dir=DIR    path to Bigtop distribution files     --build-dir=DIR    path to Bigtop distribution files [ -e /usr/libexec/bigtop-detect-javahome ]; then  . /usr/libexec/bigtop-detect-javahome [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then  . /usr/lib/bigtop-utils/bigtop-detect-javahome
oozie.init:if [ -e /usr/libexec/bigtop-detect-javahome ]; then
oozie.init:  . /usr/libexec/bigtop-detect-javahome
oozie.init:elif [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then
oozie.init:  . /usr/lib/bigtop-utils/bigtop-detect-javahome
oozie.spec:#Requires: bigtop-utils
now if you look a bit further, there some more hints that BigTop has been used big time, but that hasn't been hidden away very well...
oozie.init:if [ -e /usr/libexec/bigtop-detect-javahome ]; then
oozie.spec:#Requires: bigtop-utils
Wow, I feel proud, guys, I really am. Now I can officially claim that my ideas are giving the push to some of the biggest vendors of what is considered the hottest technology on the market ;) Although, they are unlikely to admit this publicly, predicating on the above example.

I will keep you posted on the development.

Monday, April 2, 2012

First ever Hadoop 1.0 open-source data analysis stack is out!

Basically, that says it all: BigTop 0.3 release is out as I've written in the blog. This is the major step in BigTop evolution and a very significant step in advancing open-source data analytics stack on top of Hadoop. Now people, enterprises and alike can have a Hadoop's big data stack that is reliable and fully validated for compatibility.

Why bother, you might say? Well, the biggest thing, of course, is that it is 100%, genuine open-source product. If you decided to go with it you'll prevent your business from vendor lock-in, which is always possible when you go with a commercial providers

[boilerplate to put the name of your favorite vendor in here]  

So, go ahead, configure the BigTop's packages repository for your Linux distribution of choice and enjoy the mighty Hadoop stack!

Friday, March 9, 2012

Don't ask questions or an ebay seller will block you

I am sure someone will find it funny. It is a bit longish read - 2 minutes or so, but it might save some one a few hundred dollars, thus it might well worth it.

I've looking for a while to get myself new X220 and I found an interesting seller called newthinkpads with a great deal on exactly what I was looking for. The seller looks great too: 99.7% of positive feedback and all that.

Before clicking "Buy now" button I've re-read the description once again and found some discrepancies. Naturally, when you are about to slice up $1500 bucks you want to make sure you get what you're paying for, so I've sent them a polite message:
This tablet isn't convertible, is it? The picture shows convertible model but the number is X220 which isn't. Kinda misleading information ;(
The response was a bit abrupt
newthinkpads> The item is exactly as stated in the listing.    
So, I've pressed a bit more
So, it is convertible then? Please confirm.
Also, in the description it says  Intel Centrino Advanced-N 6205N WiFiBut in the Addition information section is states that:  Network Card 10/100/1000 + Wireless (802.11a/b/g/n)So which one it is after all? 
And got a response which send me wondering
newthinkpads> Sorry I am familiar with your terminology and no customer has ever asked us this before as our listing matching Lenovo website exactly.   
I wanted to explain as simple as possible
Sorry, I will make it simplier for you if Lenovo's term "convertible" is too hard to grasp:can I swivel the monitor of this model once opened? Or it is only can be opened and closed?
As for the network card: you display two _different_ network cards in your ad. Lenovo allows to have one or another - not both.

Are you deliberately not answering my questions? How you expect me to buy a computer when I don't even know what I am buying because of the confusing information from your store? 
And got the link to a completely different model from Lenovo website
newthinkpads> This is the item we are selling, but with slightly different specs so you can just compare:  
So, I said "wait a second"
Ok, so what I see is that it is convertible and has
  4 GB DDR3 - 1333MHz (1 DIMM)
however, your ad says it's sold with 8GB. So, how much memory it has after all? 4 or 8 gigabytes?
And they called me an idiot
newthinkpads> It sounds like you are confused so perhaps ask a friend to walk you through the eBay listing.
So, I've called spades a spades
This is an insult! I have been doing computers for 25 years: building them
,repairing them and programming for them. And you call me being confused over
an apparent difference of 4 vs 8 Gb of memory in lenovo config vs yours?  WTF?
I think you need to fix your ads, fellas - they are highly inaccurate and
I will try one last time. Could you please send me the Lenovo model number for
the laptop you're selling? It looks something like this "42962YU" or similar.
Hopefully, it isn't very hard for you wise kids
And they offered to me go and f*ck myself
newthinkpads> Now you are just added to our block list for being a problem
buyer so you cannot purchase or contact us again.  This is through the ebay
messaging system so if you reply the email will not be delivered to us.
WTF? I am a problem buyer? Since when a buyer has to be a silent victim without a right to ask questions? This is an interesting motto - sounds like a highway robbery to me: "Gimme your money and shut up!".

Hmm... may be after all I need to be thankful that they have locked me out because now I am not going to be defrauded? And I wonder how hard it is to massage yourself 6000+ positive feedback being a fraudster and an impolite jerk like that one? And sure I am not gonna send them another message - I will post it on the internet, where everyone can read it.

Friday, March 2, 2012

Well-formed stack...

in my garage. Now I am ready for coming summer

Thursday, February 9, 2012

Updated version of elephants genealogy

The release manager of Hadoop 0.22 and my namesake Konstantin pointed out that my diagram has alignment problems. So, I have posted a latest version to the same post. Enjoy.

Wednesday, February 8, 2012

I wish I can draw like Scott Adams

Because this is a complete Dilbert strip. Please, Mr. Scott - make the next one like this? :)

I sweat - this is was the weirdest chat in my life (grammar and all that are original):

Me: Good morning ;)
A Person: no need to wink
AP: I don't like it
AP: not professional
Me: Am I winking?
Me: Interesting...
Me: sorry for the non-professional offense
AP: You emoitcons
Me: ah… I don't have any icons actually. It just looks like semicolon
followed by a parenthesis. Didn't know it is unprofessional. It
usually is in IT industry. In fact, it came from IT 
Me: Sorry, if you've found it weird or something. 
AP: just unprofessional.  Leave it at that.                          

Duh ;(

Saturday, January 14, 2012

What you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.

Hadoop is taking central stage in the discussions about processing of the large amount of unstructured data.

With raising the popularity of the system I found that people are really puzzled with all the multiplicity of Hadoop versions; the small, yet annoying differences introduced by different vendors; the frustration when vendors are trying to lock up their customers using readily available open source data analytic components on top of Hadoop, and on and on.

So, after explaining who was born from whom for the 3rd time - and I tell you, drawing neat pictures on a napkin in a coffee shop isn't my favorite activity - I put together this little diagram below. Click on it to inspect it in greater details. A warning: the diagram only includes more or less significant releases of Hadoop and Hadoop-derived systems available today. I don't want to waste any time on some obscure releases or branches which never been accepted at any significant level. The only exception is 0.21 which was a natural continuation of 0.20 and predecessor of recently released 0.22.

Some explanations for the diagram:
  • Green rectangles designate official Apache Hadoop releases openly available for anyone in the world for free
  • Black ovals show Hadoop branches that are not yet officially released by Apache Hadoop (or might not be released ever). However, they are usually available in the form of source code or tar-ball artifacts
  • Red ovals are for commercial Hadoop derivatives which might be based on Hadoop or use Hadoop as a part of custom systems (like in case of MapR). These derivatives can be or can be not compatible with Hadoop and Hadoop data processing stack.
Once you're presented with the view like this it is getting clear that there are two centers of the gravity in today's universe of elephants: 0.20.2 based releases and derivatives; and 0.22 based branches, future releases, and derivatives. Also, it becomes quite clear which are likely to be sucked into a black hole.

The transition from 0.20+ to 0.2[1,2] was real critical because of introduced true HDFS append, fault injection, and code injection for system testing. And the fact that 0.21 hasn't been released for a long time, creating an empty space in the high demand environment. Even after it did come out, it didn't get any traction in the community. Meanwhile, HDFS append was very critical for HBase to move forward, so 0.20.2-append has been created to support the effort. A quite similar story had happened to 0.22: two different release managers was trying to get it out: first gave up, but the second has actually succeeded in pulling an effort of a part of the community towards it.

As you can see, HDFS append wasn't available in an official Apache Hadoop release for some time (except for 0.21 with the earlier disclaimer). Eventually it has been merged into 0.20.205 (recently dubbed as Hadoop 1.0) and that allows HBase to be nicely integrated with the official Apache Hadoop without any custom patching process.

The release of 0.20.203 was quite significant because it provided a heavily tested Hadoop security, developed by Yahoo! Hadoop development team (known as HortonWorks nowadays). Bits and pieces of 0.20.203 - even before the official release - were absorbed by at least one commercial vendor to add corporate grade Kerberos security to their derivatives of Hadoop (as in case of Cloudera CDH3).

The diagram above clearly shows a few important gaps of the rest of commercial offerings:
  1. none of them supports Kerberos security (EMC, IBM, and MapR)
  2. unavailability of Hbase due to the lack of HDFS append in their systems (EMC, IBM). In case of MapR you end up using a custom HBase distributed by MapR. I don't want to make any speculation of the latter in this article.
Apparently, the vacuum of significant releases between 0.20 and 0.22 appeared to be a major urge for Hadoop PMC and now - just days after release of 1.0 - 0.22 got out. With 0.23 already going through release process, championed by HortonWorks team. That release brings in some interesting innovations like Federations and MapReduce 2.0.

Once current alpha of 0.23 (which might become Hadoop 2.0 or even Hadoop 3.0) is ready for the final release I would expect new versions of commercial distributions springing to live as it was the case before. At this point I will update the diagram :)

If you can imagine the variety of the other animals such as Pig, and Hive piling on top of Hadoop, you would get astonished by the complexity of inter-component relations and, more importantly, about intricacies of building a stable data processing stack. This is why another Apache project BigTop has been so important and popular ever since it sprung to life last year. Here you can read about Bigtop here or here.

Tuesday, January 3, 2012

Can SEI really teach you how to be Hadoop contributor?

Or of anything else for that matter?

I am kidding you not... I just got this email from SEI. In the interest of full disclosure - here it is:
To the attention of: <me>

The Software Engineering Institute (SEI) has been asked to conduct a sample survey of committers to the Hadoop Distributed File System. The results will be used to supplement existing documentation that can be used in providing guidance to HDFS contributors as well as support committers in preparing their own HDFS contributions.

You are part of a carefully chosen sample of HDFS committers for the survey. So your participation is necessary for the results to be accurate and useful. Answering all of the questions should take about 15 or 20 minutes. Any information that could identify you or your organization will be held in strict confidence by the SEI under promise of non disclosure.

You will find your personalized form on the World Wide Web at Please be sure to complete it at your earliest convenience -- right now if you can make the time. You may save your work at any time, and you may return to complete your form over more than one session if necessary for any reason. Everything will be encrypted for secure transfer and storage.
Now, let's follow the link and dig out some pearls which, I am sure, has to be in the work of such a venerable organization. What are they covering exactly?
  • Reducing unnecessary dependencies and propagation, e.g., identifying cyclic dependencies between classes in the source code 
  • Difficulty in managing data
  • Difficulty in managing namespaces
  • Identifying location of bugs
  • difficulty finding test suites
  • Communication between application
  • Reducing unnecessary dependencies and propagation
  • yada-yada-yada
Ah, I think I got the picture.... boring... 1534th research in a row on how to write effective code. Something, I like in particular:
  • "You are part of a carefully chosen sample of HDFS committers" - no shit, there's a plenty to select from, of course.
  • "Are you familiar with the (HDFS) Architectural Documentation at" - what? Are you kidding me? How the architectural docs for an ASF project ended up there? Has the design came from Hawaii? Or you could not found it where the project belongs - on Apache site?
Here's the news, my dear doctors from SEI: just try to sit and write the code, learn from others; grok the best gems written by bright practitioners. That's pretty much what it takes - one doesn't need nothing like CMMI in order to create great software. I will let myself to make even a stronger assertion: one needs processes in place to make a bunch of ineffective and inexperienced folks to produce something useless that can be later sold to an idiot customer with a lifetime of support fees attached.

Meanwhile, the reality is that today you see the ratio of three software "managers" graduated by US universities for every decent developer who doesn't need help in the day one to find his own butt with both hands, a GPS navigator, and a flashlight.

The main reason an open source software is thriving today and constantly kicking ass of companies with established processes is because people aren't afraid to fail nor to experiment on their own dime and time. In other words, they don't give a shit about CMU teaching them how to write great code - they just learn it in the field and then do what it takes by learning from others. You don't a formal training for that, clearly. Perhaps, Khan Academy is what really need.

You know that old saying "If you can't do a job - go to management; if you can't manage then teach". I would amend it by "...; if you can't teach - go to research of software processes".

Although, I won't be totally surprised to see some fat-ass book on how to contribute to Hadoop coming out from CMU very soon. And it might even become a best seller on Amazon or something. But I know for sure that by the time OSS community will be far away onto making the next great thing!

And the other day I shall tell the story of that grad student from Berkley who was all set to write the greatest benchmarking "solution" for Hadoop - that deserves a separate post, because the guy was learning from CMMI, I guess.

Am I too acidic today? Must be this damn sunny California weather or something.