Coding Hassle : Hadoop Job Scheduling

Default job scheduler for Hadoop uses a FIFO queue. The job submitted first gets all the resource it needs. This can make other jobs wait for a long time. To better utilize the resources of your cluster, you can change Hadoop job scheduler.

It was tested with Hadoop 1.0.3 on Ubuntu 12.04.

1. Default Scheduler

Default scheduler uses a FIFO queue. This means that the job submitted first gets all resources it needs and the ones submitted later will get remaining resources till the cluster runs at full capacity. Other jobs that do not have the chance to start, have to wait for running jobs to finish.

2. Why Need Another Scheduler?

Default job scheduling can hurt your applications in several levels.

You can have ad hoc queries, for example Hive queries, that are expected to finish in less than a minute.
You can have multiple report jobs, for example implemented as Pig/Hive scripts, that are waiting to be viewed by users.
You can have scheduled recurring jobs, for example Oozie workflows, that needs to be run every hour/day and update a data source in your system.
These jobs can be submitted from different users and you may want to prioritize according to the user.

With default scheduler, these cases cannot be satisfied properly.

We can use different scheduling algorithms, we will look into two of them:

Fair Scheduler
Capacity Scheduler

These two schedulers proposes slightly different solutions.

3. Fair Scheduler

In fair scheduler, every user has a different pool of jobs. However, you don't have to assign a new pool to each user. When submitting jobs, a queue name is specified (mapred.job.queue.name), this queue name can also be used to create job pools.

Cluster resources are fairly shared between these pools. User Hadooper1 can submit a single job and user Hadooper2 can submit many jobs. They get equal share of cluster on average. However, custom pools can be created by defining minimum map and reduce slots and setting a weighting for the pool.

Jobs in a pool can be scheduled using fair scheduling or first-in-first-out (FIFO) scheduling . If fair scheduling is selected for jobs inside the pool, these jobs should get equal share of resources over time. Otherwise, first submitted will be served first.

Fair scheduler provides preemption of the jobs in other pools. This means that if a pool does not get half of its fair share for a configurable time period, it can kill tasks in other pools.

4. Capacity Scheduler

Capacity scheduler gives users fair amount of cluster resources. Each user is given a FIFO queue for storing jobs.

Fair scheduler gives fair share of pool resources to the jobs inside the pool (although this pool can be configured as a FIFO queue). In contrast, capacity scheduler maintains a FIFO queue for the user's jobs. First submitted job can get all the resources given to that user.

5. Enabling Fair Scheduler

Quick Setup:
To enable fair scheduler in Hadoop 1.0.3, you just need to add following to your mapred-site.xml.
The hadoop-fairscheduler-1.0.3.jar is under the HADOOP_HOME/lib directory by default.

<property>
<name> mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>mapred.job.queue.name</value>
</property>

In this configuration, map-reduce queue name is used as pool name.

To view the fair-scheduler, you can view http://tasktracker:50030/scheduler.

Advanced Setup:
You can customize fair scheduler in two places.

HADOOP_CONF_DIR/mapred-site.xml
HADOOP_CONF_DIR/fair-scheduler.xm (Allocation file)

Hadoop documentation provides good explanation of these configuration parameters.

Coding Hassle

22 Ekim 2014 Çarşamba

Hadoop Job Scheduling - Fair Scheduler

1. Default Scheduler

2. Why Need Another Scheduler?

3. Fair Scheduler

4. Capacity Scheduler

5. Enabling Fair Scheduler

Further Reading

Hiç yorum yok:

Yorum Gönder