Search This Blog

Loading...

Friday, February 22, 2013

Multi-nodes Hadoop cluster on a single host

If you running Hadoop for experimental or else purposes you might face a need to quickly spawn a 'poor man hadoop': a cluster with multiple nodes within the same physical or virtual box. A typical use case would look like working on your laptop without access to the company's data center; another one is running low on the credit card, so you can't pay for some EC2 instances.

Stop right here, if you are well-versed in Hadoop development environment, tar balls, maven and all that shenanigans. Otherwise, keep on reading...

I will be describing Hadoop cluster installation using standard Unix packaging like .deb or .rpm, produced by the great stack Hadoop platform called Bigtop. If aren't familiar with Bigtop yet - read about its history and conceptual ideas.

Let's assume you installed Bigtop 0.5.0 release (or a part of it). Or you might go ahead - shameless plug warning - and use a free off-spring of the Bigtop just introduced by WANdisco. Either way you'll end up having the following structure:
/etc/hadoop/conf
/etc/init.d/hadoop*
/usr/lib/hadoop
/usr/lib/hadoop-hdfs
/usr/lib/hadoop-yarn
your mileage might vary if you install more components besides Hadoop. Normal bootstrap process will start a Namenode, Datanode, perhaps SecondaryNamenode, and some YARN jazz like resource manager, node manager, etc. My example will cover only HDFS specifics, because YARN's namenode would be a copy-cat and I leave it as exercise to the readers.

Now, the trick is to add more Datanodes. With a dev. setup using tarballs and such you would just clone and change some configuration parameters, and then run a bunch of java processes like:
  hadoop-daemon.sh --config <cloned config dir> start datanode

This won't work in the case of packaged installation, because of higher level of complexity involved. This is what needs to be done:
  1. Clone the config directory cp -r /etc/hadoop/conf /etc/hadoop/conf.dn2
  2. In the cloned copy of hdfs-site.xml, change or add new values for:
  3. dfs.datanode.data.dir
    dfs.datanode.address
    dfs.datanode.http.address
    dfs.datanode.ipc.address
    
    (An easy way to mod the port numbers is to add 1000*<node number>)to the default value. So, port 50020 will become 52020, etc.
  4. Go to /etc/init.d and clone hadoop-hdfs-datanode
  5. In the clone init script add the following
  6.   export HADOOP_PID_DIR="/var/run/hadoop-hdfs.dn2"
    
    and modify
      CONF_DIR="/etc/hadoop/conf.dn2"
      PIDFILE="/var/run/hadoop-hdfs.dn2/hadoop-hdfs-datanode.pid"
      LOCKFILE="$LOCKDIR/hadoop-datanode.dn2"
    
  7. Create dfs.datanode.data.dir and make hdfs:hdfs to be the owner of
  8. run /etc/init.d/hadoop-hdfs-datanode.dn2 start to fire up the second namenode
  9. Repeat steps 1 through 6 if you need more nodes running.
  10. If you need to do this on a regular basis - spare yourself a carpal tunnel and learn Puppet.
Check the logs/HDFS UI/running java processes to make sure that you have achieved what you needed. Don't try to do it unless you box has sufficient amount of memory and CPU power. Enjoy!

No comments:

Post a Comment

Followers