Search This Blog

Sunday, December 29, 2013

The best thing to do if you have ASUS 700t

In short - just get rid of the stock ASUS software as fast as possible. I've been foolish enough to put up with it for far too long. Despite of often non-warranted reboots, sluggish performance I didn't try to get rid of it for about a year. Finally, it pushed me over the edge with 12 reboots over the last two days.

Here's all I've literally done:
  • Get Titanium backup from the Play Store. I can't more highly recommend Pro version as it makes backups and especially restores a breeze.
  • Back you applications and user data. While not required, I've used an external sd card for that - not the embeded device storage.
  • Follow up the instructions from CyanogenMod Wiki Now, I happened to have TWRP recovery image installed. So, it works just fine. However, makesure that bootloader version is 10.6.1.14.4 or higher. You might want to check my earlier post on the rooting of the ASUS 700t
  • I have chosen to install release version 10.2 (along with the recommended GoogleApp zip file from their download area.
  • Once in the recovery, I have wiped up the system, caches, data and did Factory Reset on top of it. Then the zip files downloaded earlier were installed and the device rebooted for the first time.
  • Now the first boot took a couple of minutes - I guess some initial steps were done at first time. So be patient. A very easy and self explanatory setup procedure starts immediately once the boot process is over. Once the system is configured you should have the access to the Internet, your email, calendar, and Market.
  • Now it is time to restore your stuff back to its original glory :) I recommend to install TitaniumBackup from the Market first. Then run it and change preferences to point to your earlier backup location. From there I recommend to restore TitaniumBackup PRO - that will make the restoration process so much easier.
  • During the restore you can safely avoid any of the annoying ASUS apps and services if you don't need them. I acctually recommend not to restore Device Unlock app - for whatever reason restore process hung on that in my case.
  • Everything is restored. Now one more reboot just in case and you have your system back - flying high and fast.
  • One thing you might want to pay some attention to is new Privacy Guard, that allows you to restrict what the apps can learn and share about you. In other words, now you have a fine control over your personal data and prevent apps from imposing totally insane and unrealistic permission settings.
What I've noticed immediately is that I no longer have the blank message issue in my K-9 Mail, that been haunting me for like 6 months. So, it is gone for good! Keyboard works perfectly - I am typing this blog on my Transformer. So by all means folks - get youself CyanogenMod and experiene like brand new, fast tablet!

Friday, December 27, 2013

Annual review of Bigdata software; what's in store for 2014

In the couple of days left before the year end I wanted to look back and reflect on what has happened so far in the IT bubble 2.0 commonly referred to as "BigData". Here are some of my musings.

Let's start with this simple statement: BigData is misnomer. Most likely it has been put forward by some PR or MBA schmuck with no imagination whatsoever, who thought that terabyte consists of 1000 megabytes ;) The word has been picked up by pointy-haired bosses all around the world as they need buzzwords to justify their existence to people around. But I digressed...

So what has happened in the last 12 months in this segment of software development? Well, surprisingly you can count real interesting events on one hand. To name a few:
  • Fault tolerance in the distributed systems got to the new level with NonStop Hadoop, introduced by WANdisco earlier this year. The idea of avoiding complex screw-ups by agreeing on the operations up-front is leaving things like Linux HA, Hadoop QJM, and NFS based solutions rolling in the dust in the rear-view mirror.
  • Hadoop HDFS is clearly here to stay: you can see customers shifting from platforms like Teradata towards cheaper and widely supported HDFS network storage; with EMC (VMWare, Greenplum, etc.) offering it as the storage layer under Greenplum's proprietary PostegSQL cluster and many others.
  • While enjoying a huge head start, HDFS has a strong while not very obvious competitor - CEPH. As some know, there's a patch that provides CEPH drop-in replacement for HDFS. But where it get real interesting is how systems like Spark (see next paragraph) can work directly on top of CEPH file-system with a relatively small changes in the code. Just picture it:
    distributed Linux file-system <-> high-speed data analytic 
    Drawing conclusions is left as an exercise to the readers.
  • With the recent advent and fast rise of new in memory analytic platform - Apache Spark (incubating) - the traditional, two bit, MapReduce paradigm is loosing the grasp very quickly. The gap is getting wider with new generation of the task and resource schedulers gaining momentum by the day: Mesos, Spark standalone scheduler, Sparrow. The latter is especially interesting with its 5ms scheduling guarantees. That leaves the latest reincarnation of the MR in the predicament.
  • Shark - SQL layer on top of Spark - is winning the day in the BI world, as you can see it gaining more popularity. It seems to have nowhere to go but up, as things like Impala, Tez, ASF Drill are still very far away from being accepted in the data-centers.
  • With all above it is very exciting to see my good friends from AMPlab spinning up a new company that will be focusing on the core platform of Spark, Shark and all things related. All best wishes to Databricks in the coming year!
  • Speaking of BI, it is interesting to see that Bigdata BI and BA companies are still trying to prove their business model and make it self-sustainable. The case in point, Datameer with recent $19M D-round; Platfora's last year $20M B-round, etc. I reckon we'll see more fund-raisers in the 107 or perhaps 108 of dollars in the coming year among the application companies and platform ones. Also new letters will be added to the mix: F-rounds, G-rounds, etc. as cheap currency keeps finding its way from the Fed through the financial sector to the pockets of VCs and further down to high-risk sectors like IT and software development. This will lead to over-heated job market in the Silicon Valley and elsewhere followed by a blow-up similar to but bigger than 2000-2001. It will be particularly fascinating to watch big companies scavenging the pieces after the explosion. So duck to avoid shrapnel.
  • Stack integration and validation has became a pain-point for many. And I see the effects of it in shark uptake of the interest and growth of Apache Bigtop community. Which is no surprise, considering that all commercial distributions of Hadoop today are based or directly using Bigtop as the stack producing framework.
While I don't have a crystal ball (would be handy sometimes) I think a couple of very strong trends are emerging in this segment of the technology:

  • HDFS availability - and software stack availability in general - is a big deal: with more and more companies adding HDFS layer into their storage stack more strict SLAs will emerge. And I am not talking about 5 nines - an equivalent of 5 minutes downtime per year - but rather about 6 and 7 nines. I think Zookeeper based solutions are in for a rough ride.
  • Machine Learning has a huge momentum. Spark summit was a one big evidence of it. With this comes the need to incredibly fast scheduling and hardware utilization. Hence things like Mesos, Spark standalone and Sparrow are going to keep gaining the momentum.
  • Seasonal lemming-like migration to the cloud will continue, I am afraid. The security will become a red-hot issue and an investment opportunity. However, anyone who values their data is unlikely to move to the public cloud, hence - private platforms like OpenStack might be on the rise (if the providers can deal with "design by committee" issues of course).
  • Storage and analytic stack deployment and orchestration will be more pressing than ever (no, I am talking about real orchestration, not cluster management software). That's why I am looking very closely on that companies like Reactor8 are doing in this space.

So, last year brought a lot of excitement and interesting challenges. 2014, I am sure, will be even more fun. However "living in the interesting times" might a curse and a blessing. Stay safe, my friends!

Monday, September 16, 2013

MS Windows is so incompatible, that a host Linux system needs to be incompatible...

Check this out, guys - you simple can't make up stuff like that.

This what you'll see if you trying to install Wine compatibility layer on Linux (Ubuntu 12.04 in my case)

So, basically in order to install "piece of s&^t MS Windows" compatibility layer one needs to render Linux host incompatible with LSB. Good job, Microsoft! As always!

Oh, and you see this finger, Bill "small-&-soft" Gates, right?

Saturday, September 7, 2013

Hadoop Genealogy: continued

Keeping up my promises and updating the famous Hadoop genealogy tree again. Now with Hadoop 2.0.5 as well as recently released 2.0.6 and 2.1.0. Enjoy!

Sunday, June 30, 2013

High Availability is the past; Continuous Availability is the future

Do you know what are SiliconAngle and Wikibon project? If not - check them out soon. These guys have a vision about next generation media coverage. I would call it '#1 no-BS Silicon Valley media channel'. These guys are running professional video journalism with a very smart technical setup. And they aren't your typical loudmouth from the TV: they use and grok technologies they are covering. Say, they run Apache Solr in house for real-time trends processing and searches. Amazing. And they don't have teleprompters. Nor screenplay writers. How cool is that?

At any rate, I was invited on their show, theCube, last week at the last day of Hadoop Summit. I was talking about High Availability issues in Hadoop. Yup, High Availability has issues, you've heard me right. The issue is the lesser than 100% uptime. Basically, even if someone claims to provide 5-9s (that is 99.999% uptime) you still looking at about 6 minutes a year downtime of the mission critical infrastructure.

If you need 100% uptime for you Hadoop, then you should be looking for Continuous Availability. Curiously enough, the solution is found in the past (isn't that always the case?) in so called Paxos algorithm that has been published by Leslie Lamport back in 1989. However, original Paxos algorithm has some performance issues and generally never been fully embraced by the industry and it is rarely used besides of just a few tech savvy companies. One of them - WANdisco - has applied it first for Subversion replication and now for Hadoop HDFS SPOF problem and made it generally available is the commercial product.

And just think what can be done if the same technology is applied to mission critical analytical platforms such as AMPlab Spark? Anyway, watch the recording of my interview on theCube and learn more.

Sunday, May 19, 2013

YDN has posted the video from my Aug'12 talk about Hadoop distros

As the follow up on my last year post I just found the the video of the talk has been posted on YDN website. I apologies for the audio quality - echo and all, but you still should be able to make it out with a higher volume.

And in a bit you should be able to see another talk from May'13 about Hadoop stabilization.

Thursday, April 25, 2013

Hadoop 2: "alpha" elephant or not? Part 2: features and opinions

In the first part of this article I looked into the development that brought us Hadoop 2. Let's now try to analyze whether Hadoop 2 is ready for general consumption, or if it’s all just a business hype at this point. Are you better off sticking to the old, not-that-energetic grandpa who, nonetheless, delivers every time or going riding with the younger fella who might be a bit "unstable"?

New features

Hadoop 2 introduces a few very important features such as
  • HDFS High Availability (HA) with . This is what it does:
    ...In order for the Standby node to keep its state synchronized with the Active node in this implementation, both nodes communicate with a group of separate daemons called JournalNodes…In the event of a fail-over, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a fail-over occurs.
    There's an alternative approach to HDFS HA that requires an external filer (an NAS or NFS server to store a copy of the HDFS edit logs). In the case of failure of the primary NameNode, a new one can be brought over and the network-stored copy of the logs can be used to serve the clients. This is essentially a less optimal approach than QJM, as it involves more moving parts and requires more complex dev.ops.

  • An HDFS federation that essentially allows to combine multiple namespaces/namenodes to a single logical filesystem. This allows for better utilization of the higher-density storage.

  • YARN essentially implements the concept of Infrastructure-As-A-Service. You can deploy your non-MR applications to cluster nodes using YARN resource management and scheduling.

    Another advantage is the split of the old JobTracker into two independent services: resource management and job scheduling. It gives a certain advantage in the case of a fail-over and in general is a much cleaner approach to MapReduce framework implementation. YARN is API-compatible with MRv1, hence you don't need to do anything about your MR applications, just perhaps recompile the code. Just run them on YARN.

Improvements

The majority of the optimizations were made on the HDFS side. Just a few examples:
  • overall file system read/write improvements: I've seen reports of >30% performance increase from 1.x to 2.x with the same workload
  • read improvements for DN and client collocation HDFS-347 (yet to be added to the 2.0.5 release)
Good overall observation on the HDFS road map can be found here

Vendors

Here's how the bets are spread among commercial vendors, with respect to supported production-ready versions:
Hadoop 1.xHadoop 2.x
Clouderax[1]x
Hortonworksx-
Intel-x
MapRx[1]x
Pivotal-x
Yahoo!-x[2]
WANdisco-x

The worldview of software stacks

In any platform ecosystem there are always a few layers: they are like onions; onions have layers ;)
  • in the center there's a core, e.g. OS kernel
  • there are few inner layers: the system software, drivers, etc.
  • and the external layers of the onion... err, the platform -- the user space applications: your web browser and email client and such
The Hadoop ecosystem isn't that much different from Linux. There's
  • the core: Hadoop
  • system software: Hbase, Zookeeper, Spring Batch
  • user space applications: Pig, Hive, users' analytics applications, ETL, BI tools, etc.
The responsibility of bringing all the pieces of the Linux onion together lies on Linux distribution vendors: Canonical, Redhat, SUSE, etc. They pull certain versions of the kernel, libraries, system and user-space software into place and release these collections to the users. But first they make sure everything fits nicely and add some of their secret sauce on top (think Ubuntu Unity, for example). Kernel maintenance is not a part of daily distribution vendors’ business. Yet they are submitting patches and new features. A set of kernel maintainers is then responsible to bring changes to the kernel mainline. Kernel advancements are happening under very strict guidelines. Breaking compatibility with user-space is rewarded by placing a guilty person straight into the 8th circle of Inferno.

Hadoop practices a somewhat different philosophy than Linux, though. Hadoop 1.x is considered stable, and only critical bug fixes are getting incorporated into it (Table2). Whereas Hadoop 2.x is moving forward at a higher pace and most improvements are going there. That comes with at a cost to user-space applications. The situation is supposedly addressed by labeling Hadoop 2 as 'alpha' for about a year now. On the other hand, such tagging arguably prevents user feedback from flowing into the development community. Why? Because users and application developers alike are generally scared away by the "alpha" label: they'd rather sit and wait until the magic of stabilization happens. In the meanwhile, they might use Hadoop 1.x.

And, unlike the Canonical or Fedora project, there's no open-source integration place for the Hadoop ecosystem. Or is there?

Integration

There are 12+ different components in the Hadoop stack (as represented by the BigTop project). All these are moving at their own pace and, more often than not, support both versions of Hadoop. This complicates the development and testing. It creates a large amount of issues for the integration of these projects. Just think about the variety of library dependencies and such that might all of a sudden be at conflict or have bugs (HADOOP-9407 comes to mind). Every component also comes with its own configuration, adding insult to injury for all the tweaks in Hadoop.

All this brings a lot of issues to the DevOps who need to install, maintain, and upgrade your average Hadoop cluster. In many cases, DevOps simply don't have the capacity or knowledge to build and test a new component of the stack (or a newer version of it) before bringing it to the production environment. Most of the smaller companies and application developers don't have the expertise to build and install multiple versions from the release source tarballs, configure and performance tune of the installation.

That's where software integration projects like BigTop come into the spotlight. BigTop was started by Roman Shaposhnik (ASF Bigtop, Chair PMC) and Konstantin Boudnik (ASF Bigtop, PMC) at the Yahoo! Hadoop team back in 2009-2010. It was a continuation of earlier work based on expertise in software integration and OS distributions. BigTop provides a versatile tool for creating software stacks with predefined properties, validates the compatibility of integral parts, and creates native Linux packaging to ease the installation experience.

BigTop includes a set of Puppet recipes -- an industry standard configuration management system -- that allows to spin up a Hadoop cluster in about 10 minutes. The cluster can be configured for Kerber'ized or non-secure environments. A typical release of BigTop looks like a stack's bill-of-materials and source code. It lets anyone quickly build and test a packaged Hadoop cluster with a number of typical system and user-space components in it. Most of the modern Hadoop distributions are using BigTop openly or under the hood, making BigTop a de facto integration spot for all upstream projects

Conclusions

Here's Milind Bhandarkar (Chief Architect at Pivotal):
As part of HAWQ stress and longevity testing, we tested HDFS 2.0 extensively, and subjected it to the loads it had never seen before. It passed with flying colors. Of course, we have been testing the new features in HDFS since 0.22! EBay was the first to test new features in HDFS 2.0, and I had joined Konstantin Schvachko to declare Hadoop 0.22 stable, when the rest of the community called it crazy. Now they are realizing that we were right.
YARN is known for very high stability. Arun Murthy - RM of all of 2.0.x-alpha releases and one of the YARN authors - in the 2.0.3-alpha release email:
# Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release - see here)
And there's this view that I guess is shared by a number of application developers and users sitting on the sidewalks:
I would expect to have a non-alpha semi-stable release of 2.0 by late June or early July.  I am not an expert on this and there are lots of things that could show up and cause those dates to slip.
In the meanwhile, six out of seven vendors are using and selling Hadoop 2.x-based versions of storage and data analytics solutions, system software, and service. Who is right? Why is the "alpha" tag kept on for so long? Hopefully, now you can make your own informed decision.

References:

[1]: EOLed or effectively getting phased out
[2]: Yahoo! is using Hadoop 0.23.x in production, which essentially is very close to the Hadoop 2.x source base

Monday, April 22, 2013

Hadoop 2: "alpha" elephant or not?

Today I will look into the state of Hadoop 2.x and try to understand what has kept it in the alpha state to date. Is it really an "alpha" elephant? This question keeps popping up on the Internet, in conversations with customers and business partners. Let's start with some facts first.

The first anniversary of the Hadoop 2.0.0-alpha release is around the corner. SHA 989861cd24cf94ca4335ab0f97fd2d699ca18102 was made on May 8th, 2012, marking the first-ever release branch of the Hadoop 2 line (in the interest of full disclosure: the actual release didn't happen until a few days later, May 23rd).[1]

It was a long-awaited event. And sure enough, the market accepted it enthusiastically.  The commercial vendor Cloudera announced its first Hadoop 2.x-based CDH4.0 at the end of June 2012, according to this statement from Cloudera's VPoP -- just a month after 2.0.0-alpha went live! So, was it solid, fact-based trust of the quality of the code base, or something else?  An interesting nuance: MapReduce v1 (MRv1) was brought back despite the presence of YARN (a new resource scheduler and a replacement for the old MapReduce). One of those things that make you go, "Huh...?"

We've just seen the 2.0.4-alpha RC vote getting closed: the fifth release in a row in just under one year. Many great features went in: YARN; HDFS HA; HDFS performance optimizations, to name a few. An incredible amount of stabilization has been done lately, especially in 2.0.4-alpha. Let's consider some numbers:

Table1: JIRAs committed to Hadoop between 2.0.0-alpha and 2.0.4-alpha releases
HADOOP383
HDFS801
MAPREDUCE219
YARN138
That's about 1,500 fixes and features since the beginning. Which was to be expected, considering the scope of implemented changes and the need for smoothing things out.

Let's for a moment look into Hadoop 1.x -- essentially the same old Hadoop 0.20.2xx -- per latest genealogy of elephants -- a well-respected and stable patriarchy. Hadoop 1.x had 8 releases altogether in 14 months:
  • 1.0.0 released on Dec 12, 2011
  • 1.1.2 released on Feb 15, 2013
Table2: JIRAs committed to Hadoop between 1.0.0 and 1.1.2 releases
HADOOP110
HDFS111
MAPREDUCE84
That's about five times fewer fixes and improvements than what went into Hadoop 1.x over roughly the same time. If frequency of change is any indication of stability, then perhaps we are onto something.

For the sake of full disclosure here's the similar statistics for Hadoop 0.23.x. There was 8 "dot"-releases between 01 Nov 2011 (0.23.0) and 16 Apr 2013 (0.23.7).
Table3: JIRAs committed to Hadoop between 0.23.0 and .0.23.7 releases
HADOOP514
HDFS687
MAPREDUCE1240
YARN92[2]

"Wow," one might say, "no wonder the 'alpha' tag has been so sticky!" Users definitely want to know if the core platform is turbulent and unstable. But wait... wasn't there that commercial release that happened a month after the first OSS alpha? If it was more stable than the official public alpha, then why did it take the latter another five releases and 1,500 commits to get where it is today? Why wasn't the stabilization simply contributed back to the community? Or, if both were of the same high quality to begin with, then why is the public Hadoop 2.x still wearing the "alpha" tag one year later?

Before moving any further: all 13 releases -- for 1.x and 2.x --  were managed by engineers from Hortonworks. Tipping my hat to those guys and all contributors to the code!

So, is Hadoop 2 that unstable after all? In the second part of this article I will dig into the technical merits of the new development line so we can decide for ourselves. To be continued...

References:
[1] All release info is available from official ASF Hadoop release page
[2] First appeared in release 0.23.3 

Friday, April 19, 2013

On coming fragmentation of Hadoop platform


I just read this interview with the CEO of HortonWorks in which he expresses a fear about Hadoop fragmentation. He calls attention to the valid issue in the Hadoop ecosystem where forking is getting to the point that product space is likely to get fragmented.


So why should the BigTop community bother? Well, for one, Hadoop is the core upstream component of the BigTop stack. By filling this unique position, it has a profound effect on downstream consumers such as HBase, Oozie, etc. Although projects like Hive and Pig can partially avoid potential harm by statically linking with Hadoop binaries, this isn't a solution for any sane integration approach. As a side note: I am especially thrilled by Hive's way of working around multiple incompatibilities in the MR job submission protocol. The protocol has been naturally evolving for quite some time, and no one could even have guaranteed compatibility in versions like 0.19 or 0.20. Anyway, Hive solved the problem by simply generating a job jar, constructing a launch string and then - you got it already, right? - System.exec()'ing the whole thing. On a separate JVM, that is! Don't believe me? Go check the source code yourself.


Anecdotal evidence aside, there's a real threat of fracturing the platform. And there's no good reason for doing so even if you're incredibly selfish, or stupid, or want to monopolize the market. Which, by the way, doesn't work for objective reasons even with so-called "IP protection" laws in place. But that's a topic for another day.


So, what's HortonWorks’ answer to the problem? Here it comes:

Amid current Hadoop developments---is there any company NOT launching a distribution with some value added software?---Hortonworks stands out. Why? Hortonworks turns over its entire distribution to the Apache open source project.
While it is absolutely necessary for any human endeavor to be collaborative in order to succeed, the open source niche might be a tricky one. There are literally no incentives for all players to play by the book, and there's always that one very bold guy who might say, "Screw you guys, I’m going home," because he is just... you know...


Where could these incentives come from? How can we be sure that every new release is satisfactory for everyone's consumption? How do we guarantee that HBase’s St.Ack and friends won't be spending their next weekend trying to fix HBase when it loses its marbles because of that tricky change in Hadoop’s behavior?


And here comes a hint of an answer:

We're building directly in the core trunk, productizing the package, doing QA and releasing.

I have a couple of issues with this statement. But first, a spoiler alert: I am not going to attack neither HortonWorks nor their CEO. I don't have a chip on my shoulder -- not even an ARM one. I am trying to demonstrate the fallacy in the logic and show what doesn't work and why. And now here's the laundry list:

  • "building directly in the core trunk": Hadoop isn't released from the trunk. This is a headache. And this is one of the issues that the BigTop community faced during the most recent stabilization exercise for the Hadoop 2.0.4-alpha release. Why's that a problem? Well, for one, there's a policy that "everything should go through the trunk". It means -- in context of Hadoop’s current state -- that you have to first commit to the trunk, then back-port to branch-2, which is supposed to be the landing ground for all Hadoop 2.x releases, just like branch-1 is the landing ground for all Hadoop 1.x releases. If it so happens that there's an active release(s) happening at the moment, one would need to back-port the commit to another release branch(es), such as 2.0.4-alpha in this particular example. Mutatis mutandis, some of the changes are reaching only about 2/3 of the way down. Best-case scenario. This approach also gives fertile ground to all "proponents" of open-source Hadoop because once their patches are committed to the trunk, they are as open-source as the next guy. They might get released in a couple of years, but hey -- what's a few months between friends, right?
  • "productizing the package": is Mr. Bearden aware of when development artifacts for an ongoing Hadoop release were last published in the open? ‘Cause I don't know of a publication of any such thing to date. Neither does Google, by the way. Even the official source tarballs weren't available until, like, 3 weeks ago. Why does that constitute a problem? How do you expect to perform any reasonable integration validation if you don't have an official snapshot of the platform? Once your platform package is "productized", it is a day late to pull your hair out. If you happen to find some issues -- come back later. At the next release, perhaps?
  • "doing QA and releasing": we are trying to build an open-source community here, right? Meaning that the code, the tests and their results, the bug reports, the discussions should be in the open. The only place where the Hadoop ecosystem is being tested at any reasonable length and depth is BigTop. Read here for yourself  And feel free to check the regular builds and test runs for _all_ the components that BigTop releases for both secured and non-secured configurations. What are you testing with and how, Mr. Bearden?

So, what was the solution? Did I miss it in the article? I don't think so. Because a single player -- even one as respected as HortonWorks -- can't solve the issue in question without ensuring that anything produced by the Hadoop project's developers is always in line with the expectations of downstream players.


That's how you prevent fracturing: by putting in the open a solid and well-integrated reference implementation of the stack - one that can be installed by anyone using open-standard packaging and loaded with third-party applications without tweaking them every time you go from Cloudera's cluster to MapR's. Or another pair of vendors’. Does it sound like I am against making money in open-source software? Not at all: most people in the OSS community do this on the dime of their employers or as part of their own business.


You can consider BigTop's role in the Hadoop centric environment to be similar to that of Debian in the Linux kernel/distribution ecosystem. By helping to close the gap between the applications and the fast-moving core of the stack, BigTop essentially brings reassurance of the Hadoop 2.x line's stability into the user space and community. BigTop helps to make sure that vendor products are compatible with each other and with the rest of the world; to avoid vendor lock-in and to guarantee that recent Microsoft stories will not be replayed all over again.


Are there means to achieve the goal of keeping the core contained? Certainly! BigTop does just that. Recent announcements from Intel, Pivotal, WANdisco are living proof of it: they all using BigTop as the integration framework and consolidation point. Can these vendors deviate even under such a top-level integration system? Sure. But this will be immensely harder to do.

Acceptance of open Hadoop stack: role of BigTop

I have just posted this article on ASF blog roller elaborating on why BigTop is becoming a center piece of integration focused on Hadoop-based data analytically stack. Enjoy.

Thursday, April 18, 2013

Dealing with noisy fan of Lenovo ThinkPad T430

I recently got my shiny new ThinkPad T430 beefed up with 16GB of RAM, 180GBSSD, 1TB extra disk and many other good things. I really enjoying it and can run multinode virtual Hadoop cluster while developing something in my favorite IntelliJ IDEA. Did I mention already that I feel sorry for Apple PowerBook users?

Anyway, I was unlucky enough to get this particular machine with a faulty fan. It has been reported widely that due to some QC issues at Lenovo or whatnot a number of the laptops (T420/T430 alike) is coming with the fan that has especially maddening pitch at around 3200-3400 rpms. It is one of those sounds that drills your skull and drives you buts in about 17 seconds from the start of it.

If it by itself wasn't bad enough, the BIOS sets the fan to that speed when the temperature is anywhere above like 36 degrees C. If you have the laptop on your laps - that's where you would expect it to have, judging by the name - you're doomed.

There're lots of complains from people about this and attempts to solve it with clumsy scripts written in Python, etc. However, the working solution was under my nose pretty much all that time, waiting for me on Gentoo Wiki.

It also works on Ubuntu 12.04 like a charm. The only modification you need to be aware of on Ubuntu is how to add thinkfan to your machine's startup sequence. You'd need to run
sudo update-rc.d thinkfan defaults
instead of the the page above suggest. It is so quite here again!

Update: in all fairness, I had to eventually replace the fan, because it was still too noisy in the above speeds, although, it didn't get engage too often. Now, it is completely silent ;)

Thursday, March 28, 2013

Another Hadoop... err elephant family...

Oh, they are so adorable. I wish every elephant family is like that!


Friday, March 1, 2013

Shameless open source's assassin at Strata 2013: Microsoft

The first thing you see entering to the Strata 2013 exhibition is hugely tasteless Miscrosoft booth that is sitting in the premium spot of the floor right next to the front doors. Later update: don't believe me: check this
This year BigData conference was largely focused on Hadoop and downstream technologies: HBase, Spark, and so on. As most of you would know, this technological niche is based on two open source whales: Java/JVM, and Linux. Now, unless you spent last 15 years on Mars, you would know that there's no love lost between the Microsoft and open source community. And don't hurry to blame open source for this endless assault. Check just a few results from Google search in vague chronological order ranging from 1998 to 2012:
And there are many other evidences that you can easily dig up by just scrolling through about 56,000,000 articles found by Google. 

Some apologists might say "The Microsoft has changed lately". Really? Banning Linux from booting on ARM based Windows 8 devices is just an honest mistake then, I presume.

Perhaps "It is their hardware and they can do whatever they are pleased" other pin-heads might say. Do I need to answer this? How about me paying for the hardware? Do I own it now? Or am I just leasing it from dudes up in Redmond, WA? Do I need to go down to basic economical explanations about natural law?

And after all these heinous things they are coming to the midst of the open source celebration like nothing happens. They even sponsored the lunch on the last day of the conference. I think people who ate that poisoned stuff might find themselves enslaved to Microsoft via some kinda of EULA or something.

But joking aside, do these guys have a nerve or what? May be MS thinks they now are a king of the hill just by virtue of hiring for legwork this certain startup, who's founders did a lot of the Hadoop initial work? Judge for yourself.

Thursday, February 28, 2013

We just invented a new game: "Whack a Hadoop namenode"





I just came back from Strata 2013 BigData conference. A pretty interesting event, considering that Hadoop wars are apparently over. It doesn't mean that the battlefield is calm. On the contrary!


But this year's war banner is different. Now it seems to be about Hadoop stack distributions. If I only had an artistic talent, the famous
would be saying something like "Check out how big is my Hadoop distro!"

But judge for yourself: WANdisco announced their WDD about 4 weeks ago, followed yesterday by Intel and Greenplum press releases. WDD has some uniquely cool stuff in it like non-stop namenode, which is the only 'active-active' technology for Namenode metadata replication on the market based on full implementation of Paxos algorithm,

And I was having fun during the conference too: we were playing the game 'whack-a-namenode'. The setup includes a rack of blade Supermicro servers, running WDD cluster with three active namenodes.
While running stock TeraSort load, one of the namenode is killed dead with SIGKILL. Amazingly, TeraSort can't care less and just keep going without a wince. We played about a 100 rounds of this "game" over the course of two days using live product, with people dropping by all the time to watch.

Looks like it isn't easy to whack an HDFS cluster anymore.

And nice folks from SiliconAngle and WikiBon stopped at our booth to do the interview with me and my colleagues. Enjoy ;)

Sunday, February 24, 2013

Rooting JB Asus Transformer TF700

I didn't expect this to be so hard, really. The rooting is usually a pretty straight forward process that can be done quickly. However, considering the amount of issues around the upgrading ICS table to JB with or without rooting the first; number of the posts on xda-developers.com forum that refers to the same partially outdated recovery images, it wasn't easy. However, here's the easiest way to root your stock JB Transformer device:

Prerequisites:
  # I am doing everything on Linux
  # You need to either install Android SDK or get adb and fastboot tools for your distribution

1. Push CWM-SuperSU to the device
    % adb push SuperSU.zip /sdcard
2. Download and unpack ClockWorkMod Recovery v6.0.1.4
    rename recovery_jb.img to recovery.img
3. Boot your device to fastboot mode
   % adb reboot bootloader
Use Vol- to scroll to USB icon (fastboot mode);  use Vol+ to select
4. Unlock the bootloaded using UnLock_Device_App_V7.apk (google to download the file). Or alternatively you should be able to do use
  % fastboot <VendorID> oem unlock
(please note, that you might be better off by running fastbook as root user). ASUS VendorID is 0x0B05. To find out the id for your device use lsusb.
5. Flash recovery.img to your device and reboot
  % fastboot -i 0x0B05 flash recovery recovery.img
  % fastboot -i 0x0B05 reboot
6. Boot to fastboot mode again (as in #2 above) and enter Recovery mode
7. Install SuperSU from the zip file on sdcard
8. Reboot once again
9. Install RootChecker from Google Play and make sure your devices is rooted.

Enjoy!

Friday, February 22, 2013

Multi-nodes Hadoop cluster on a single host

If you running Hadoop for experimental or else purposes you might face a need to quickly spawn a 'poor man hadoop': a cluster with multiple nodes within the same physical or virtual box. A typical use case would look like working on your laptop without access to the company's data center; another one is running low on the credit card, so you can't pay for some EC2 instances.

Stop right here, if you are well-versed in Hadoop development environment, tar balls, maven and all that shenanigans. Otherwise, keep on reading...

I will be describing Hadoop cluster installation using standard Unix packaging like .deb or .rpm, produced by the great stack Hadoop platform called Bigtop. If aren't familiar with Bigtop yet - read about its history and conceptual ideas.

Let's assume you installed Bigtop 0.5.0 release (or a part of it). Or you might go ahead - shameless plug warning - and use a free off-spring of the Bigtop just introduced by WANdisco. Either way you'll end up having the following structure:
/etc/hadoop/conf
/etc/init.d/hadoop*
/usr/lib/hadoop
/usr/lib/hadoop-hdfs
/usr/lib/hadoop-yarn
your mileage might vary if you install more components besides Hadoop. Normal bootstrap process will start a Namenode, Datanode, perhaps SecondaryNamenode, and some YARN jazz like resource manager, node manager, etc. My example will cover only HDFS specifics, because YARN's namenode would be a copy-cat and I leave it as exercise to the readers.

Now, the trick is to add more Datanodes. With a dev. setup using tarballs and such you would just clone and change some configuration parameters, and then run a bunch of java processes like:
  hadoop-daemon.sh --config <cloned config dir> start datanode

This won't work in the case of packaged installation, because of higher level of complexity involved. This is what needs to be done:
  1. Clone the config directory cp -r /etc/hadoop/conf /etc/hadoop/conf.dn2
  2. In the cloned copy of hdfs-site.xml, change or add new values for:
  3. dfs.datanode.data.dir
    dfs.datanode.address
    dfs.datanode.http.address
    dfs.datanode.ipc.address
    
    (An easy way to mod the port numbers is to add 1000*<node number>)to the default value. So, port 50020 will become 52020, etc.
  4. Go to /etc/init.d and clone hadoop-hdfs-datanode
  5. In the clone init script add the following
  6.   export HADOOP_PID_DIR="/var/run/hadoop-hdfs.dn2"
    
    and modify
      CONF_DIR="/etc/hadoop/conf.dn2"
      PIDFILE="/var/run/hadoop-hdfs.dn2/hadoop-hdfs-datanode.pid"
      LOCKFILE="$LOCKDIR/hadoop-datanode.dn2"
    
  7. Create dfs.datanode.data.dir and make hdfs:hdfs to be the owner of
  8. run /etc/init.d/hadoop-hdfs-datanode.dn2 start to fire up the second namenode
  9. Repeat steps 1 through 6 if you need more nodes running.
  10. If you need to do this on a regular basis - spare yourself a carpal tunnel and learn Puppet.
Check the logs/HDFS UI/running java processes to make sure that you have achieved what you needed. Don't try to do it unless you box has sufficient amount of memory and CPU power. Enjoy!

Wednesday, February 6, 2013

One more Hadoop in the family!

Indeed "may you live in interesting times". Not so long ago I posted the update of my elephants genealogy and it seems to be outdated already. Oh, well - I guess it is an exciting thing to be bothered with - because I love all kinds of elephants ;)

This is the birth of another Apache Hadoop's brother! The young dude has been definitely born with a silver spoon in the mouth. It is called Active/Active replication of NameNode - the very first in the world to my limited knowledge in the matter. Pretty cool, eh?

WANdisco is releasing their certified version of Hadoop as the base of their own BigData distribution called WDD. Hence, I need to update the tree again.

And congratulations on the release, guys - the more the merrier!


Saturday, January 5, 2013

Worst boss contest (2012 edition)

Another The End of The World Year is behind us and it is time for my annual "High Tech Worst Manager of the Year" contest.

Fortunately, a single contestant has been enrolled to the competition in 2012. And of course he won - again - hands down!

Ladies and gentlemen: the winner - Jolly Chen, Cloudera, CA - the winner in the category of "The Worst Boss" for 2 years straight!

I am wondering why even being so close to a noble elephants (Hadoop and all) doesn't help to improve the winner's jolly personality?

Friday, January 4, 2013

What you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants (New Year 2013).

About a year ago I have published the original genealogy of elephants. A number of changes have happened in the field since then and as some of my readers have pointed out the original diagram became stale. I am certainly thankful for the feedback and I am publishing the new version today. It sounds like a good New Year resolution too ;)

As you can see there are quite a bit of changes in the tree. Say, CDH4 looks like a child of promiscuous parents - not for the first time, apparently - as it inherits characteristics of H1 and H2 at the same time (MR1 if you wondering); and some other changes as well.

Ladies and gentlemen: Genealogy of Elephants II. 

A dear friend's of mine college application,,,

Tell us about a personal quality, talent, accomplishment, contribution or experience that is important to you. What about this quality or accomplishment makes you proud, and how does it relate to the person you are:
I am a Slavic immigrant often mistaken for an Australian, even when I leave my kangaroo at home. I am a true philanthropist, as any of my sponsors will tell you. Rooms overflow with my kindness, and then I clean them because of how kind I am. Thanks to me, the fifth dentist now also recommends Crest. I once single-handedly won an arm wrestling match. I type 1000 words per minute, and the result makes sense.

My concern with the environment is so great that beavers think twice before felling trees when I am near. I am brave enough to admit that pollution is a problem, which is why you will see me wearing a mask when I take my Hummer out for a drive. My desire to promote diversity and help people feel included is so great that I give tourists directions even when I do not understand what they are asking. Incidentally, I believe that racial intolerance is absolutely unacceptable, and some nations' inability to grasp this is outrageous.    

At night I dress up in clothes I have designed and fashioned and go to fight crime. If I witness crime during the day, I wait until nighttime to
confront the culprit at hand, for what would this world come to without
punctuality and order? My band's chosen genre is percussive maintenance, and we regularly hold charity gigs at Apple stores. I refrain from slapping people who do not know how to properly use the word “whom,” and I do not sing on the bus.

I take inflation into consideration when giving to the beggars I pass on my way to work every day. I have convinced several skunk families and one hippie to use deodorant. I did not accept the reward Waldo's family offered when I found him hiding in his closet and persuaded him to come out. When he did, I did not judge him. I can slow down time itself – especially in maths class. I have organized a volunteer group for teaching children fire safety. I have often paid people's emergency room bills for them. When my friends come to me with their problems, I encourage them to do what I do not doubt is right. I am someone who can admit to being wrong, even when I was convinced otherwise. I learn from my mistakes every day, multiple times a day. Some say I am an ideal role model for the new generation.

But, most importantly, I am modest, and that is my defining quality.

Certainly, it deserves a scholarship ...

Followers