Search This Blog


Tuesday, March 3, 2015

Schema-on-read or schema-on-write?

I was recently asked if schema-on-read is superior to schema-on-write and how it relates to "traditional" storage systems like EMC, Netapp, Teradata seemingly loosing the ground to commodity-based storage systems. Here are some semi-random thoughts

First of all, I think schema-on-read/schema-on-write is a fancy way to say if the data was stored in non-structured or structured way. It all boils down to where there's a need to store unaltered data or not. If statistics teaches us anything at all, it would be that by  creatively selecting a subset of data, or making changes in a data sample or in a model itself you can prove the correlation between anything imaginable ;) Hence, there are clear benefits of keeping data 'as-is': without any pre-processing, cleaning, dedup'ing, and so on. It will allow you to run different models or apply alternative approximations.

It might appear that schema-on-read approach might be always superior, but there is plenty of cases where it isn't so. All sorts of scientific, engineering, financial, medical, accounting systems would still enjoy the benefits of data structuring for years to come. And of course, there are good cases for the opposite, non-structured storage way: marketing, social studies, economical modeling (which is an utter nonsense, of course, but people still believe in it for some reason), and so on.

I don't think that "schema-on-write" technology is inherently so much more expensive. In the overall order of things one might safe some on the pre-processing stage by using commodity hardware and open software, but will have to pay more in direct and indirect costs related to more expensive and slower BA & BI solutions,

For all I know, we might be witnessing an end game for EMC/Netapp & co., but not because of the way they pre-process the data before storing it. Their very challenge is in the huge change of software development landscape, that has happened over the last 20+ years with Gnu, Linux, ASF and other free and open software models. No doubt, these companies have well-developed sales channels and established brands, but it is almost impossible to out-sale something that anyone can download from the net for the cost of the bandwidth, and get up and running in a matter of hours or even faster. And there's a whole spectrum of such open systems, so you don't have to lock yourself up to either of them. Now, go and compete with that!

Monday, February 23, 2015

Warning [Rant]: YAML is an incredible piece of turd

I spent, hay wasted, an hour of my time today trying to figure out the reason for the following error message from Puppet Hiera:

vmhost05-hbase3: Error: syntax error on line 30, col -1: `' at /root/bigtop/bigtop-deploy/puppet/manifests/site.pp:17 on node ....
The relevant part of the Hiera site.yaml file is

bigtop::bigtop_yumrepo_uri:  ""
bigtop::jdk_package_name: 'jdk-1.7.0_55'

Firstly, as a former compiler developer it hurts every bit of my brain when I see error message like above. Huge "compliment" to the Hiera developers - learn how to write code, dammit.

Secondly, after investigating this literally for an hour I figured out that the separator in uri:  "http was a TAB (ASCII 9) instead of a whitespaces.

Seriously dudes - it's 21st century. What's the reason to use formats and parsers that fail so badly on separator terminals? Just imaging if Java or Groovy compiler would be so picky about tabs vs. spaces? I guarantee - the half of the development community would be screaming bloody murder right there. Yet - with frigging YAML POS it is just ok ;(


Monday, December 22, 2014

How to mount RAID1 volume on Ubuntu

If you ever need to mount an encrypted partition from a RAID1 NAS on your Ubuntu system (like a laptop or a different server) here's a simple three steps instruction. Figure out what partition needs to be mounted (you can do it by running parted or similar to figure out what your target should be); for the sake of the example it will be /dev/sdd2. And now:

% sudo mdadm --assemble --run /dev/md0 /dev/sdd2
% sudo cryptsetup -v luksOpen /dev/md0 mapperpoint
% sudo mount /dev/mapper/mapperpoint /mnt/external/

If you need to check the state of the drive while connected via USB enclosure, run
% sudo smartctl -aH -d sat /dev/sdd
The only trick is to add -d sat disk type.

Or to simplify the whole thing, just run Disk Utility and click "Start RAID" button ;)

Friday, November 28, 2014

Finally upgrading from Debian Lenny

If you like me was putting of an upgrade from Debian 5.0 Lenny you might find yourself blocked out, because doesn't exist anymore anywhere on the US mirrors. However, I needed to do one last update before getting on dist-upgrade. 

Luckily enough I was able to find a mirror in Germany which still has Lenny dist around. So, if you find yourself in my shoes edit /etc/apt/sources.list on your system and replace

then do usual update and then an upgrade. Good luck!

Monday, July 7, 2014

Mark Twain and data science

When I look at data science nowadays it reminds of
In the space of one hundred and seventy-six years the Mississippi has
shortened itself two hundred and forty-two miles.  Therefore ... in the Old
Silurian Period the Mississippi River was upward of one million three hundred thousand miles long ... seven hundred and forty-two years from now the Mississippi will be only a mile and three-quarters long.  ... There is something fascinating about science.  One gets such wholesome returns of conjecture out of such a trifling investment of fact.
                 Mark Twain
I don't know why...

Sunday, June 15, 2014

Hadoop genealogy simplified

I have decided to simplify the elephant genealogy tree by separating pre-Hadoop 2.x part out of it. The new supported version will only be reflecting Hadoop 2.x. The last updated full version of the diagram is available for anyone from my github workspace under the tag WDD4

Tuesday, April 22, 2014

Smart-ass recruiters out there....

An spam-email from a recruiter (target company name isn't mentioned _anywhere_):

   I came across your resume on Dice and would like to talk to you about
   a Hadoop Engineer position with our client, which is a Fortune
   10 company based in Cupertino, CA. This is the largest and most
   valuable consumer electronics/technology company in the world today
   which makes cutting edge smart phones, personal computers & music

Now I wonder which one of us is an idiot?