Test are done with Hadoop 1.0.3 on Ubuntu 12.04 machine.
1. Number of Map Tasks
When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with org.apache.hadoop.mapred.FileInputFormat. For each of these input splits, a map task must be started.First, let's investigate properties that govern the input split size:
mapred.min.split.size
Default value: 1
Description: The minimum size chunk that map input should be split into. Note that some file formats may have minimum split sizes that take priority over this setting.
mapred.max.split.size
Default value: This configuration cannot be set in Hadoop 1.0.3, it is calculated in code. However in later versions, its default value is Long.MAX_VALUE, that is 9223372036854775807.
Description: The largest valid size inbytes for a file split.
dfs.block.size
Default value: 64 MB, that is 67108864
Description: The default block size for new files.
If you are using a newer Hadoop version, some of the above properties are deprecated. You can check from here.
Configure the properties:
You can set mapred.min.split.size in mapred-site.xml
<property>
<name>mapred.min.split.size</name>
<value>127108864</value>
</property>
You can set dfs.block.size in hdfs-site.xml
<property>
<name>dfs.block.size</name>
<value>67108864</value>
</property>
Split Size Calculation Examples:
Calculation of input split size is done in InputFileFormat as:
Math.max(minSize, Math.min(maxSize, blockSize));
mapred.min.split.size | mapred.max.split.size | dfs.block.size | Input Split Size | Input Splits (1GB file) |
1 (default) | Long.MAX_VALUE(default) | 64MB(Default) | 64MB | 16 |
1 (default) | Long.MAX_VALUE(default) | 128MB | 128MB | 8 |
128MB | Long.MAX_VALUE(default) | 64MB | 128MB | 8 |
1 (default) | 32MB | 64MB | 32MB | 32 |
Configuring input split size larger than block size decreases data locality and can degrade performance.
According to above table, if file size is 1GB, there will be respectively 16, 8, 8 and 32 input splits.
What if input files are too small?
FileInputFormat splits files that are larger than split size. What if out input files are too small? In this point, FileInputFormat creates a input split per file. For example, if you have 100 10KB files and input split size is 64MB, there will be 100 input splits. Total file size is 1MB, but we will have 100 input splits and 100 map tasks. This is known as Hadoop small files problem. You can look at CombineFileInputFormat for a solution. Apache Pig combines small input files into one map by default.
So the number of map tasks depends on
- Total size of input
- Input split size
- Structure of input files (small files problem)
YanıtlaSilYour very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
Python Training in Chennai| Python Training course in Chennai
Datascience Training in Chennai |Datascience Training course in Chennai
RPA Training in Chennai |RPA Training course in Chennai
DevOps Training in Chennai |DevOps Training course in Chennai
AWS Training in Chennai | AWS Training course in Chennai