22 Ekim 2014 Çarşamba

Hadoop Job Scheduling - Fair Scheduler

Default job scheduler for Hadoop uses a FIFO queue. The job submitted first gets all the resource it needs. This can make other jobs wait for a long time. To better utilize the resources of your cluster, you can change Hadoop job scheduler.

It was tested with Hadoop 1.0.3 on Ubuntu 12.04.

1. Default Scheduler

Default scheduler uses a FIFO queue. This means that the job submitted first gets all resources it needs and the ones submitted later will get remaining resources till the cluster runs at full capacity. Other jobs that do not have the chance to start, have to wait for running jobs to finish.

2. Why Need Another Scheduler?

Default job scheduling can hurt your applications in several levels.

  • You can have ad hoc queries, for example Hive queries, that are expected to finish in less than a minute.
  • You can have multiple report jobs, for example implemented as Pig/Hive scripts, that are waiting to be viewed by users.
  • You can have scheduled recurring jobs, for example Oozie workflows, that needs to be run every hour/day and update a data source in your system.
  • These jobs can be submitted from different users and you may want to prioritize according to the user.

With default scheduler, these cases cannot be satisfied properly.

We can use different scheduling algorithms, we will look into two of them:

  • Fair Scheduler
  • Capacity Scheduler

These two schedulers proposes slightly different solutions.

3. Fair Scheduler

In fair scheduler, every user has a different pool of jobs. However, you don't have to assign a new pool to each user. When submitting jobs, a queue name is specified (mapred.job.queue.name), this queue name can also be used to create job pools.

Cluster resources are fairly shared between these pools. User Hadooper1 can submit a single job and user Hadooper2 can submit many jobs. They get equal share of cluster on average. However, custom pools can be created by defining minimum map and reduce slots and setting a weighting for the pool.

Jobs in a pool can be scheduled using fair scheduling or first-in-first-out (FIFO) scheduling . If fair scheduling is selected for jobs inside the pool, these jobs should get equal share of resources over time. Otherwise, first submitted will be served first.

Fair scheduler provides preemption of the jobs in other pools. This means that if a pool does not get half of its fair share for a configurable time period, it can kill tasks in other pools.

4. Capacity Scheduler

Capacity scheduler gives users fair amount of cluster resources. Each user is given a FIFO queue for storing jobs.

Fair scheduler gives fair share of pool resources to the jobs inside the pool (although this pool can be configured as a FIFO queue). In contrast, capacity scheduler maintains a FIFO queue for the user's jobs. First submitted job can get all the resources given to that user.

5. Enabling Fair Scheduler

Quick Setup:
To enable fair scheduler in Hadoop 1.0.3, you just need to add following to your mapred-site.xml.
The hadoop-fairscheduler-1.0.3.jar is under the HADOOP_HOME/lib directory by default.

<property>
    <name> mapred.jobtracker.taskScheduler</name>
    <value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
    <name>mapred.fairscheduler.poolnameproperty</name>
    <value>mapred.job.queue.name</value>
</property>

In this configuration, map-reduce queue name is used as pool name. 
To view the fair-scheduler, you can view http://tasktracker:50030/scheduler.

Advanced Setup:
You can customize fair scheduler in two places.

  • HADOOP_CONF_DIR/mapred-site.xml
  • HADOOP_CONF_DIR/fair-scheduler.xm (Allocation file)
Hadoop documentation provides good explanation of these configuration parameters.

Further Reading

1. http://hadoop.apache.org/docs/r1.0.4/fair_scheduler.html
2. http://blog.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/









Hadoop Versions Explained

There is confusion around Hadoop versions. You can see versions like 0.20+, you have 1.0+ versions and you have 2.0+ versions. We have old MR1 and new MR2 with YARN. You can see the list of Hadoop releases in Apache release page. In this page, you can see there are releases spanning from 0.23+ to 1.0+ and to 2.0+ releases.

There are few blog posts that tries to explain evolution of Hadoop version tree. Most complete ones are  here and here. The author has multiple posts that reflect new versions and we can expect to see more updates on his blog. Other good posts that explain the history of versions are here from Cloudera's blog and here from Hortonworks's blog.

To summarize the version tree:

  • Version 0.20.205 is renamed as Hadoop 1.0.0. Later 1.0+ releases continues from here. This provides old MR1 API.
  • Version 0.22 serves as the basis for Hadoop 2.0+ releases. This provides new MR2 API and YARN.
  • Version 0.23 continues to get releases in its own tree. As I understand, this version gets only point releases and does not implement any new feature.
I hopes this helps to clear the confusion.







5 Ekim 2014 Pazar

Hadoop 1.0.3 Installation on Ubuntu

I will describe the required steps to install Hadoop 1.0.3 on a single node in pseudo-distributed mode. At the end of this post, you will be able to browse HDFS(Hadoop File System) and run map-reduce jobs on your single-node Hadoop cluster.

Hadoop 1.0.3 is used and installed on Ubuntu 12.04 TLS.

Prerequisites

1. Java

For installing Hadoop, you must have Java 1.5+ (Java 5 or above). I will continue with Java 1.7.

> sudo apt-get install openjdk-7-jdk

If you have different Java versions installed on your machine, you can select new Java 1.7 by typing:
> update-alternatives --config java
You can then select your desired java version.

You can check current java version by typing:
> java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.12.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

2. SSH

Hadoop uses ssh to connect and manages its nodes. This is also valid for a single node setup.

Openssh client is included in Ubuntu by default. However, you should also have openssh server installed.
> dpkg --get-selections | grep -v deinstall | grep openssh

If the list does not contain client or server, you can install missing one:
> sudo apt-get install openssh-client
> sudo apt-get install openssh-server

Now, you should connect to localhost:
> ssh localhost

This will ask for your password. Hadoop needs to establish connection without entering password.

To enable this:
Generate public and private keys
> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Authorize the key by adding it to the list of authorized keys
> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

You should now connect without password
> ssh localhost

 3. Disable IPv6

To disable IPv6 , open /etc/sysctl.conf  and add the following lines to the end of the file:

> vi /etc/sysctl.conf

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

You should restart your machine for changes to take effect.

4. Dedicated User for Hadoop

Although it is not necessary, you can create a dedicated user for Hadoop. This will help you seperate Hadoop management from other applications.

Create a user named hadoopuser and assign to group named hadoopgroup. You can get more detail about creating users and groups in this post.
> sudo groupadd hadoopgroup
> sudo useradd hadoopuser -m
> sudo usermod -aG hadoopgroup hadoopuser

Installation

1. Get Hadoop

You can get your desired Hadoop version from Apache download mirrors:
I will download Hadoop 1.0.3
> cd /home/hadoopuser
> wget http://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/hadoop-1.0.3.tar.gz

Extract hadoop package under home directory
> cd /home/hadoopuser
> sudo tar -xzvf hadoop-1.0.3.tar.gz
> sudo chown -R hadoopuser:hadoopgroup /home/hadoopuser/hadoop-.1.0.3

2. Set your environment variables

I will set HADOOP_HOME environment variable. You get more detail about setting environment variables in this post.

I will make HADOOP_HOME accesible system-wide, not per user. To do this, first create a system_env.sh under /etc/profile.d folder.
> vi /etc/profile.d/system_env.sh

export HADOOP_HOME=/home/hadoopuser/hadoop-1.0.3
export PATH=$PATH:$HADOOP_HOME/bin

3. Configuration

You can configure following configuration files as stated below for a basis.

$HADOOP_HOME/conf/hadoop-env.sh

Set JAVA_HOME variable in this file:
> vi /home/hadoopuser/hadoop-1.0.3/conf/hadoop-env.sh

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

$HADOOP_HOME/conf/core-site.xml

<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:10000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoopuser/tmp</value>
  <description>A base for other temporary directories.</description>
</property>
</configuration>

$HADOOP_HOME/conf/mapred-site.xml

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:10001</value>
        <description>The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
          and reduce task.
        </description>
    </property>
</configuration>

$HADOOP_HOME/conf/hdfs-site.xml

<configuration>
<property>
   <name>dfs.webhdfs.enabled</name>
   <value>true</value>
</property>
<property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>
</configuration>

4. Format HDFS

To start using your Hadoop cluster, we should format HDFS. This is done when cluster is first setup. If you format an existing HDFS, data stored on it will be removed.
> hadoop namenode format

5. Start the Cluster

Hadoop provides several control scripts that enables you start/stop Hadoop daemons

To start Hadoop cluster, run:
> /home/hadoopuser/hadoop-1.0.3/bin/start-all.sh

This will start all 5 daemons: NameNode, SecondaryNameNode, JobTracker, TaskTracker and DataNode. You can check whether these daemons are running by typing:
> jps
or
> ps aux | grep hadoop

6. Explore the Cluster

Hadoop provides several web interfaces to monitor your cluster. You can browse these interfaces.