In MapReduce word count example, we find out the frequency of each word. Here, the role of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the keys of common values. So, everything is represented in the form of Key-value pair.
Pre-requisite
- Java Installation - Check whether the Java is installed or not using the following command.
java -version - Hadoop Installation - Check whether the Hadoop is installed or not using the following command.
hadoop version
If any of them is not installed in your system, follow the below link to install it.
www.BoardInfinity/hadoop-installation
Steps to execute MapReduce word count example
- Create a text file in your local machine and write some text into it.
$ nano data.txt
- Check the text written in the data.txt file.
$ cat data.txt
In this example, we find out the frequency of each word exists in this text file.
- Create a directory in HDFS, where to kept text file.
$ hdfs dfs -mkdir /test - Upload the data.txt file on HDFS in the specific directory.
$ hdfs dfs -put /home/codegyani/data.txt /test
- Write the MapReduce program using eclipse.
File: WC_Mapper.java
package com.javatpoint;
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class WC_Mapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()){ word.set(tokenizer.nextToken()); output.collect(word, one); } } }
File: WC_Reducer.java
import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; public class WC_Reducer extends MapReduceBase implements Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum=0; while (values.hasNext()) { sum+=values.next().get(); } output.collect(key,new IntWritable(sum)); } }
File: WC_Runner.java
import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class WC_Runner { public static void main(String[] args) throws IOException{ JobConf conf = new JobConf(WC_Runner.class); conf.setJobName("WordCount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(WC_Mapper.class); conf.setCombinerClass(WC_Reducer.class); conf.setReducerClass(WC_Reducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf,new Path(args[0])); FileOutputFormat.setOutputPath(conf,new Path(args[1])); JobClient.runJob(conf); } }
Download the source code.
- Create the jar file of this program and name it countworddemo.jar.
- Run the jar file
hadoop jar /home/codegyani/wordcountdemo.jar com.javatpoint.WC_Runner /test/data.txt /r_output - The output is stored in /r_output/part-00000
- Now execute the command to see the output.
hdfs dfs -cat /r_output/part-00000