It was tested with Hadoop 1.0.3 on Ubuntu 12.04.
1. Default Scheduler
Default scheduler uses a FIFO queue. This means that the job submitted first gets all resources it needs and the ones submitted later will get remaining resources till the cluster runs at full capacity. Other jobs that do not have the chance to start, have to wait for running jobs to finish.2. Why Need Another Scheduler?
Default job scheduling can hurt your applications in several levels.- You can have ad hoc queries, for example Hive queries, that are expected to finish in less than a minute.
- You can have multiple report jobs, for example implemented as Pig/Hive scripts, that are waiting to be viewed by users.
- You can have scheduled recurring jobs, for example Oozie workflows, that needs to be run every hour/day and update a data source in your system.
- These jobs can be submitted from different users and you may want to prioritize according to the user.
With default scheduler, these cases cannot be satisfied properly.
We can use different scheduling algorithms, we will look into two of them:
- Fair Scheduler
- Capacity Scheduler
These two schedulers proposes slightly different solutions.
3. Fair Scheduler
In fair scheduler, every user has a different pool of jobs. However, you don't have to assign a new pool to each user. When submitting jobs, a queue name is specified (mapred.job.queue.name), this queue name can also be used to create job pools.Cluster resources are fairly shared between these pools. User Hadooper1 can submit a single job and user Hadooper2 can submit many jobs. They get equal share of cluster on average. However, custom pools can be created by defining minimum map and reduce slots and setting a weighting for the pool.
Jobs in a pool can be scheduled using fair scheduling or first-in-first-out (FIFO) scheduling . If fair scheduling is selected for jobs inside the pool, these jobs should get equal share of resources over time. Otherwise, first submitted will be served first.
Fair scheduler provides preemption of the jobs in other pools. This means that if a pool does not get half of its fair share for a configurable time period, it can kill tasks in other pools.
4. Capacity Scheduler
Capacity scheduler gives users fair amount of cluster resources. Each user is given a FIFO queue for storing jobs.Fair scheduler gives fair share of pool resources to the jobs inside the pool (although this pool can be configured as a FIFO queue). In contrast, capacity scheduler maintains a FIFO queue for the user's jobs. First submitted job can get all the resources given to that user.
5. Enabling Fair Scheduler
Quick Setup:To enable fair scheduler in Hadoop 1.0.3, you just need to add following to your mapred-site.xml.
The hadoop-fairscheduler-1.0.3.jar is under the HADOOP_HOME/lib directory by default.
<property>
<name> mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>mapred.job.queue.name</value>
</property>
In this configuration, map-reduce queue name is used as pool name.
To view the fair-scheduler, you can view http://tasktracker:50030/scheduler.
Advanced Setup:
You can customize fair scheduler in two places.
2. http://blog.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/
Advanced Setup:
You can customize fair scheduler in two places.
- HADOOP_CONF_DIR/mapred-site.xml
- HADOOP_CONF_DIR/fair-scheduler.xm (Allocation file)
Hadoop documentation provides good explanation of these configuration parameters.
Further Reading
1. http://hadoop.apache.org/docs/r1.0.4/fair_scheduler.html2. http://blog.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/