Search This Blog

Loading...

Friday, February 22, 2013

Multi-nodes Hadoop cluster on a single host

If you running Hadoop for experimental or else purposes you might face a need to quickly spawn a 'poor man hadoop': a cluster with multiple nodes within the same physical or virtual box. A typical use case would look like working on your laptop without access to the company's data center; another one is running low on the credit card, so you can't pay for some EC2 instances.

Stop right here, if you are well-versed in Hadoop development environment, tar balls, maven and all that shenanigans. Otherwise, keep on reading...

I will be describing Hadoop cluster installation using standard Unix packaging like .deb or .rpm, produced by the great stack Hadoop platform called Bigtop. If aren't familiar with Bigtop yet - read about its history and conceptual ideas.

Let's assume you installed Bigtop 0.5.0 release (or a part of it). Or you might go ahead - shameless plug warning - and use a free off-spring of the Bigtop just introduced by WANdisco. Either way you'll end up having the following structure:
/etc/hadoop/conf
/etc/init.d/hadoop*
/usr/lib/hadoop
/usr/lib/hadoop-hdfs
/usr/lib/hadoop-yarn
your mileage might vary if you install more components besides Hadoop. Normal bootstrap process will start a Namenode, Datanode, perhaps SecondaryNamenode, and some YARN jazz like resource manager, node manager, etc. My example will cover only HDFS specifics, because YARN's namenode would be a copy-cat and I leave it as exercise to the readers.

Now, the trick is to add more Datanodes. With a dev. setup using tarballs and such you would just clone and change some configuration parameters, and then run a bunch of java processes like:
  hadoop-daemon.sh --config <cloned config dir> start datanode

This won't work in the case of packaged installation, because of higher level of complexity involved. This is what needs to be done:
  1. Clone the config directory cp -r /etc/hadoop/conf /etc/hadoop/conf.dn2
  2. In the cloned copy of hdfs-site.xml, change or add new values for:
  3. dfs.datanode.data.dir
    dfs.datanode.address
    dfs.datanode.http.address
    dfs.datanode.ipc.address
    
    (An easy way to mod the port numbers is to add 1000*<node number>)to the default value. So, port 50020 will become 52020, etc.
  4. Go to /etc/init.d and clone hadoop-hdfs-datanode
  5. In the clone init script add the following
  6.   export HADOOP_PID_DIR="/var/run/hadoop-hdfs.dn2"
    
    and modify
      CONF_DIR="/etc/hadoop/conf.dn2"
      PIDFILE="/var/run/hadoop-hdfs.dn2/hadoop-hdfs-datanode.pid"
      LOCKFILE="$LOCKDIR/hadoop-datanode.dn2"
    
  7. Create dfs.datanode.data.dir and make hdfs:hdfs to be the owner of
  8. run /etc/init.d/hadoop-hdfs-datanode.dn2 start to fire up the second namenode
  9. Repeat steps 1 through 6 if you need more nodes running.
  10. If you need to do this on a regular basis - spare yourself a carpal tunnel and learn Puppet.
Check the logs/HDFS UI/running java processes to make sure that you have achieved what you needed. Don't try to do it unless you box has sufficient amount of memory and CPU power. Enjoy!

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thanks for such an informative post! The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects. More at Hadoop Online Training

    ReplyDelete

Followers