Home > Software > BIGDATA > MapReduce
Interview Questions   Tutorials   Discussions   Programs   Videos   

MapReduce - Hadoop WordCount Explanation




584
views
asked marvit November 18, 2014 08:59 PM  

Hadoop WordCount Explanation


           

2 Answers



 
answered By marvit   0  

Detailed explanation of MAPREDUCE program

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

    // We use this to get the String representation of the Text data type which is 
    // more suitable for distributed processing.
    String line = value.toString();

    // A tokenizer tokenizes (or divides) a sentence into individual words. It is 
    // deprecated though (not used anymore), so we should use line.split()
    // String[] tokens = line.split();
    StringTokenizer tokenizer = new StringTokenizer(line);

    // The tokenizer gives out a boolean (true or false) based on whether it has 
    // more tokens (words) or not. If split() is used, we can use a for loop.
    // for (String token : tokens) {
    //    word.set(token);
    while (tokenizer.hasMoreTokens()) {
        // I am guessing word is of Text type. Since like I previously said, Text 
        // data type is more suitable for distributed computing, we are converting 
        // the String token we have into text type. We have to define the word 
        // variable somewhere though.
        // If split() is used, we can write word.set(token);
        word.set(tokenizer.nextToken());
    }

    // Context is something which lets you pass key-value pairs forward. Once you 
    // write them using a Context object, the shuffle is performed and after the 
    // shuffle, they are grouped by key and each key along with its values is 
    // passed to the reducer.
    context.write(word, one);
}

Driver main class

public static void main(String[] args) throws Exception {
//creating a JobConf object and assigning a job name for identification
purposes
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
//Setting configuration object with the Data Type of output Key and Value
for //map and reduce if you have diffrent type of outputs there is other set
method //for them
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
//Providing the mapper and reducer class names
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
//set theCombiner class
conf.setReducerClass(Reduce.class);
// The default input format, "TextInputFormat," will load data in as //
(LongWritable, Text) pairs. The long value is the byte offset of the line in
//the file.
conf.setInputFormat(TextInputFormat.class);
// The basic (default) instance is TextOutputFormat, which writes (key, value)
//pairs on individual lines of a text file.
conf.setOutputFormat(TextOutputFormat.class);
//the hdfs input and output directory to be fetched from the command line
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
//submits the job to MapReduce. and returns only after the job has completed
JobClient.runJob(conf);
}
flag   
   add comment

 
answered By marvit   0  

WordMapper.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
 private Text word = new Text();
 private final static IntWritable one = new IntWritable(1);

 @Override
 public void map(Object key, Text value,
   Context contex) throws IOException, InterruptedException {
  // Break line into words for processing
  StringTokenizer wordList = new StringTokenizer(value.toString());
  while (wordList.hasMoreTokens()) {
   word.set(wordList.nextToken());
   contex.write(word, one);
  }
 }
}

SumReducer.java

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;



public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

 private IntWritable totalWordCount = new IntWritable();

 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
  int wordCount = 0;
  Iterator<IntWritable> it=values.iterator();
  while (it.hasNext()) {
   wordCount += it.next().get();
  }
  totalWordCount.set(wordCount);
  context.write(key, totalWordCount);
 }
}

WordCount.java (Driver)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
 public static void main(String[] args) throws Exception {
        if (args.length != 2) {
          System.out.println("usage: [input] [output]");
          System.exit(-1);
        }


        Job job = Job.getInstance(new Configuration());
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(WordMapper.class);
        job.setReducerClass(SumReducer.class); 

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(WordCount.class);

        job.submit();






 }

}
flag   
   add comment

Your answer

Join with account you already have

FF

Preview

 Write A Tutorials
Online-Classroom Classes
www.writeabc.com


  1 person following this question

  1 person following this tag

  Question tags

mapreduce × 1

Asked 3 years and 29 days ago ago
Number of Views -584
Number of Answers -2
Last updated
3 years and 28 days ago ago

Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!

Alert