MapReduce implements wordcount statistics

Keywords: Programming Hadoop network Java

Inherit Mapper's generics

public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>

Longwritable - > start offset

Text - > entered text

Text - > output text

Longwritable - > count

Of the four generics, the first two are the type of specified mapper input data, the key is the type of input key, and the value is the type of input value

The data input and output of map and reduce are encapsulated in the form of key value pairs

By default, in the input data passed to our mapper by the framework, key is the starting offset of a line of text to be processed, and the content of this line is taken as value

Serialization problem:

In order to transmit key value data in the network, serialization must be implemented. java has its own serialization function, but the data is redundant, which will be detrimental to MapReduce's massive data analysis process. Therefore, hadoop can realize its own serialization.

Inherit Mapper, override map method

The mapreduce framework calls this method every time it reads a row of data

The specific business logic is written in the method body, and the data to be processed by our business has been passed in by the framework. In the parameter of the method, key value

@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {

}

Implement business logic code:

public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
	
	//The mapreduce framework calls this method every time it reads a row of data
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException, InterruptedException {
		//The specific business logic is written in the method body, and the data to be processed by our business has been passed in by the framework. In the parameter of the method, key value
		//key is the starting offset of this line of data value is the text content of this line
		
		//Convert the contents of this line to string type
		String line = value.toString();
		
		//The text in this line is segmented by a specific separator
		String[] words = StringUtils.split(line, " ");
		
		//Traverse the word array and output it in k v form K: word v: 1
		for(String word : words){
			
			context.write(new Text(word), new LongWritable(1));
			
		}
		

	}

inherit Reducer Realization reduce Method

public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
	
	
	
	//After map processing, the framework caches all kv pairs, groups them, passes a group < key, valus {} >, and calls the reduce method once
	//<hello,{1,1,1,1,1,1.....}>
	@Override
	protected void reduce(Text key, Iterable<LongWritable> values,Context context)
			throws IOException, InterruptedException {

		long count = 0;
		//Traverse the list of value s and add them up
		for(LongWritable value:values){
			
			count += value.get();
		}
		
		//Output the statistics of this word
		
		context.write(key, new LongWritable(count));
		
	}
	
	

}

After the map and reduce codes are completed respectively, there needs to be a class to describe the whole logic:

Where is the map distribution, where is the reduce distribution; which map is used, which reduce? You also need to make a jar package.

The whole process of a business logic processing is called a job. It tells the cluster which job to use, which project, which map, reduce, the path to process data, and the output results.

/**
 * Used to describe a specific job
 * For example, which class does the job use as the map in logical processing and which is the reduce
 * You can also specify the path of the data to be processed by the job
 * You can also specify the path where the output of the job will be put
 * ....
 * @author duanhaitao@itcast.cn
 *
 */
public class WCRunner {

	public static void main(String[] args) throws Exception {
		
		Configuration conf = new Configuration();
		
		Job wcjob = Job.getInstance(conf);
		
		//Set which jar package the classes used in the whole job are in
		wcjob.setJarByClass(WCRunner.class);
		
		
		//mapper and reducer classes used in this job
		wcjob.setMapperClass(WCMapper.class);
		wcjob.setReducerClass(WCReducer.class);
		
		
		//Specify the output data kv type of reduce
		wcjob.setOutputKeyClass(Text.class);
		wcjob.setOutputValueClass(LongWritable.class);
		
		//Specify the output data kv type of mapper
		wcjob.setMapOutputKeyClass(Text.class);
		wcjob.setMapOutputValueClass(LongWritable.class);
		
		
		//Specify the storage path of input data to be processed
		FileInputFormat.setInputPaths(wcjob, new Path("hdfs://weekend110:9000/wc/srcdata/"));
		
		//Specify the storage path of output data for processing results
		FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://weekend110:9000/wc/output3/"));
		
		//Submit job to cluster for running 
		wcjob.waitForCompletion(true);
			
	}
			
}

Make the project into jar package and upload d to hadoop cluster

Start hadoop yarn

hadoop jar specified class

Posted by Okami on Thu, 05 Mar 2020 21:21:55 -0800

Programmer Group

MapReduce implements wordcount statistics

Hot Keywords