Coding Hassle : Hadoop 1.0.3 Installation on Ubuntu

I will describe the required steps to install Hadoop 1.0.3 on a single node in pseudo-distributed mode. At the end of this post, you will be able to browse HDFS(Hadoop File System) and run map-reduce jobs on your single-node Hadoop cluster.

Hadoop 1.0.3 is used and installed on Ubuntu 12.04 TLS.

Prerequisites

1. Java

For installing Hadoop, you must have Java 1.5+ (Java 5 or above). I will continue with Java 1.7.

> sudo apt-get install openjdk-7-jdk

If you have different Java versions installed on your machine, you can select new Java 1.7 by typing:

> update-alternatives --config java

You can then select your desired java version.

You can check current java version by typing:

> java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.12.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

2. SSH

Hadoop uses ssh to connect and manages its nodes. This is also valid for a single node setup.

Openssh client is included in Ubuntu by default. However, you should also have openssh server installed.

> dpkg --get-selections | grep -v deinstall | grep openssh

If the list does not contain client or server, you can install missing one:

> sudo apt-get install openssh-client
> sudo apt-get install openssh-server

Now, you should connect to localhost:

> ssh localhost

This will ask for your password. Hadoop needs to establish connection without entering password.

To enable this:
Generate public and private keys

> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

Authorize the key by adding it to the list of authorized keys

> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

You should now connect without password
> ssh localhost

3. Disable IPv6

To disable IPv6 , open /etc/sysctl.conf and add the following lines to the end of the file:

> vi /etc/sysctl.conf

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You should restart your machine for changes to take effect.

4. Dedicated User for Hadoop

Although it is not necessary, you can create a dedicated user for Hadoop. This will help you seperate Hadoop management from other applications.

Create a user named hadoopuser and assign to group named hadoopgroup. You can get more detail about creating users and groups in this post.

> sudo groupadd hadoopgroup
> sudo useradd hadoopuser -m
> sudo usermod -aG hadoopgroup hadoopuser

Installation

1. Get Hadoop

You can get your desired Hadoop version from Apache download mirrors:
I will download Hadoop 1.0.3

> cd /home/hadoopuser
> wget http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz

Extract hadoop package under home directory

> cd /home/hadoopuser
> sudo tar -xzvf hadoop-1.0.3.tar.gz
> sudo chown -R hadoopuser:hadoopgroup /home/hadoopuser/hadoop-.1.0.3

2. Set your environment variables

I will set HADOOP_HOME environment variable. You get more detail about setting environment variables in this post.

I will make HADOOP_HOME accesible system-wide, not per user. To do this, first create a system_env.sh under /etc/profile.d folder.

> vi /etc/profile.d/system_env.sh

export HADOOP_HOME=/home/hadoopuser/hadoop-1.0.3
export PATH=$PATH:$HADOOP_HOME/bin

3. Configuration

You can configure following configuration files as stated below for a basis.

$HADOOP_HOME/conf/hadoop-env.sh

Set JAVA_HOME variable in this file:

> vi /home/hadoopuser/hadoop-1.0.3/conf/hadoop-env.sh

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

$HADOOP_HOME/conf/core-site.xml

<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:10000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>A base for other temporary directories.</description>
</property>
</configuration>

$HADOOP_HOME/conf/mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:10001</value>
        <description>The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
          and reduce task.
        </description>
    </property>
</configuration>

$HADOOP_HOME/conf/hdfs-site.xml

<configuration>
<property>
   <name>dfs.webhdfs.enabled</name>
   <value>true</value>
</property>
<property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>
</configuration>

4. Format HDFS

To start using your Hadoop cluster, we should format HDFS. This is done when cluster is first setup. If you format an existing HDFS, data stored on it will be removed.

> hadoop namenode format

5. Start the Cluster

Hadoop provides several control scripts that enables you start/stop Hadoop daemons

To start Hadoop cluster, run:

> /home/hadoopuser/hadoop-1.0.3/bin/start-all.sh

This will start all 5 daemons: NameNode, SecondaryNameNode, JobTracker, TaskTracker and DataNode. You can check whether these daemons are running by typing:

> jps

> ps aux | grep hadoop

6. Explore the Cluster

Hadoop provides several web interfaces to monitor your cluster. You can browse these interfaces.

Coding Hassle

5 Ekim 2014 Pazar

Hadoop 1.0.3 Installation on Ubuntu