Home > Software > BIGDATA > HADOOP
Interview Questions   Tutorials   Discussions   Programs   Videos   Discussion   

HADOOP - Hadoop - Number of Map and Reduce Tasks




2001
views
asked experts May 18, 2015 09:42 PM  

Hadoop - Number of Map and Reduce Tasks


           

1 Answer



 
answered By experts   0  
Hadoop - Number of Map and Reduce Tasks

1. Number of Map Tasks

When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with org.apache.hadoop.mapred.FileInputFormat. For each of these input splits, a map task must be started.


First, let's investigate properties that govern the input split size:
mapred.min.split.size 
Default value: 1
Description: The minimum size chunk that map input should be split into. Note that some file formats may have minimum split sizes that take priority over this setting.

mapred.max.split.size 
Default value: This configuration cannot be set in Hadoop 1.0.3, it is calculated in code. However in later versions, its default value is Long.MAX_VALUE, that is 9223372036854775807.
Description: The largest valid size inbytes for a file split.

dfs.block.size
Default value: 64 MB, that is 67108864
Description: The default block size for new files.

If you are using a newer Hadoop version, some of the above properties are deprecated. You can check from here.


Configure the properties:
You can set mapred.min.split.size in mapred-site.xml
  mapred.min.split.size
  127108864

You can set dfs.block.size in hdfs-site.xml
  dfs.block.size
  67108864

Split Size Calculation Examples:
Calculation of input split size is done  in InputFileFormat as:
Math.max(minSize, Math.min(maxSize, blockSize));
mapred.min.split.sizemapred.max.split.sizedfs.block.sizeInput Split SizeInput Splits (1GB file)
1 (default)Long.MAX_VALUE(default)64MB(Default)64MB16
1 (default)Long.MAX_VALUE(default)128MB128MB8
128MBLong.MAX_VALUE(default)64MB128MB8
1 (default)32MB64MB32MB32


Configuring input split size larger than block size decreases data locality and can degrade performance.
According to above table, if file size is 1GB, there will be respectively 16, 8, 8 and 32 input splits.

What if input files are too small?
FileInputFormat splits files that are larger than split size. What if out input files are too small? In this point, FileInputFormat creates a input split per file. For example, if you have 100 10KB files and input split size is 64MB, there will be 100 input splits. Total file size is 1MB, but we will have 100 input splits and 100 map tasks. This is known as Hadoop small files problem. You can look at CombineFileInputFormat for a solution. Apache Pig combines small input files into one map by default.

So the number of map tasks depends on
- Total size of input
- Input split size
- Structure of input files (small files problem)

2. Number of Reducer Tasks

The number of reduce tasks to create is determined by themapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and Hadoop simply creates this number of reduce tasks to be run.
flag   
   add comment

Your answer

Join with account you already have

FF

Preview


Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!

Alert