21 Haziran 2014 Cumartesi

Hadoop - Small Files Problem

I will explain Hadoop small files problem. I will also mention some approaches that deal with this problem.

Hadoop 1.0.3 is used on Ubuntu 12.04


Hadoop is designed to work with large data, HDFS also is efficient with large volumes of data. If there are lots of small files that are significantly smaller than the HDFS block size (which is 64MB by default), both HDFS and Hadoop mapreduce will suffer from it.

1. Implications on HDFS

HDFS stores metadata of every file, directory and block on memory. One file, using a block, holds 300 byte on namenode's memory. If there are 10 million records, this gives us 3GB memory usage. If this number accumulates over time, namenode cannot maintain its operation. 
In addition to that, HDFS is designed to stream large amount of data. There is an overhead when accessing small files, gathering files under a directory from different datanodes. 

2. Implications on MapReduce

When a job is submitted, Hadoop first calculates input splits using FileInputFormat class. If files are smaller than split size (generally equals to HDFS block size), FileInputFormat creates a split per file. For each split, a map task is created. For example if you have 100000 10KB files (totaling to 1GB), you will have 100000 map tasks. Map tasks will process very small data and there will be overhead of creating map tasks. If the files have been 64 MB, there would be only be 16 splits and map tasks.

3. Some Options

Sequence Files: Small files can be written as sequence files.
Har Files: Hadoop archive files can be used to group multiple file to one. This way, namenode will store only Har file's information. However, as an input to mapreduce, Har files does not offer any enhancement.
CombineFileInputFormat: This class can be used to combine small files into one map task. This does not offer any enhancement on HDFS.

Hiç yorum yok:

Yorum Gönder