Inverted Index for MapReduce Programming Development

Keywords: Hadoop Apache Java Linux

Inverted index is a variant of word frequency statistics, in fact, it is also a word frequency statistics, but this word frequency statistics need to add the name of the file. Inverted index is widely used for full-text retrieval. The final result of the inverted index is a collection of the number of times a word appears in a document. Take the following data as an example:

file1.txt

hdfs hadoop mapreduce
hdfs bigdata
hadoop mapreduce

file2.txt

mapreduce hdfs
hadoop bigdata mapreduce
hdfs hadoop
hdfs mapreduce
bigdata hadoop

file3.txt

bigdata hadoop mapreduce hdfs
hadoop hdfs mapreduce
bigdata hadoop

The final result is:

bigdata	file3.txt:2;file2.txt:2;file1.txt:1;
hadoop	file1.txt:2;file3.txt:3;file2.txt:3;
hdfs	file2.txt:3;file1.txt:2;file3.txt:2;
mapreduce	file3.txt:2;file1.txt:2;file2.txt:3;

The design idea of inverted index is similar to that of word frequency statistics, but it is not the same. Here we need to add the files of words to make statistics. To achieve this effect, first of all, we need to make a statistics of the words and the documents they appear. So in map, we do the same as word frequency statistics, except that the output words are not single words, but words +":"+files, similar to this:

<hdfs:file1.txt , 1>
<hdfs:file2.txt , 1>
<hdfs:file3.txt , 1>
<hadoop:file1.txt , 1>
<mapreduce:file3.txt , 1>

Before reducing, we need to carry out a combine process, in which our input data is as follows: <Word: Document, <Number of Statistics>:

<hdfs:file1.txt , <1,1>>
<hadoop:file1.txt , <1,1>>
<hdfs:file2.txt , <1,1,1>>
<hdfs:file3.txt , <1,1>>

We made a change and output it as <Words, Documents: Statistics and >

<hdfs , file1.txt:2>
<hdfs , file2.txt:3>
<hdfs , file3.txt:2>

In the reduce phase, we can splice the values together. To sum up the process with a diagram is:

The code for inverted index is given below:

package com.xxx.hadoop.mapred;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
 * Inverted index
 *
 */
public class InvertIndexApp {

	public static class Map extends Mapper<LongWritable, Text, Text, Text>{
		private Text keyinfo = new Text();
		private Text valueinfo = new Text();
		private FileSplit split;
		protected void map(LongWritable key, Text value,
				Context context) throws IOException ,InterruptedException {
			split = (FileSplit)context.getInputSplit();
			StringTokenizer tokenizer = new StringTokenizer(value.toString());
			while(tokenizer.hasMoreTokens()) {
				int splitIndex = split.getPath().toString().indexOf("file");
				keyinfo.set(tokenizer.nextToken()+":"+split.getPath().toString().substring(splitIndex));
				valueinfo.set("1");
				context.write(keyinfo, valueinfo);
			}
		}
	}
	
	
	public static class Combine extends Reducer<Text, Text, Text, Text>{
		private Text info = new Text();
		protected void reduce(Text key, java.lang.Iterable<Text> values, 
				Context context) throws IOException ,InterruptedException {
			int sum = 0;
			for(Text value:values) {
				sum += Integer.parseInt(value.toString());
			}
			int splitIndex = key.toString().indexOf(":");
			info.set(key.toString().substring(splitIndex+1)+":"+sum);
			key.set(key.toString().substring(0,splitIndex));
			context.write(key, info);
		}
	}
	
	public static class Reduce extends Reducer<Text, Text, Text, Text>{
		private Text result = new Text();
		protected void reduce(Text key, Iterable<Text> values,
				Context context) throws java.io.IOException ,InterruptedException {	
			String filelist = new String();
			for(Text value:values) {
				filelist += value.toString()+";";
			}
			result.set(filelist);
			context.write(key, result);
		};
	}
	public static void main(String[] args) throws Exception{
		String input = "/user/root/invertindex/input";
		String output = "/user/root/invertindex/output";
		System.setProperty("HADOOP_USER_NAME", "root");
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", "hdfs://192.168.56.202:9000");
		FileSystem fs = FileSystem.get(conf);
		boolean exists = fs.exists(new Path(output));
		if(exists) {
			fs.delete(new Path(output), true);
		}
		Job job = Job.getInstance(conf);
		job.setJarByClass(InvertIndexApp.class);
		
		job.setMapperClass(Map.class);
		job.setCombinerClass(Combine.class);
		job.setReducerClass(Reduce.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		
		FileInputFormat.addInputPath(job, new Path(input));
		FileOutputFormat.setOutputPath(job, new Path(output));
		System.exit(job.waitForCompletion(true)?0:1);
	}

}

Prepare data before running:

Running the program, the console prints the information as follows:

2019-09-02 09:56:22 [INFO ]  [main]  [org.apache.hadoop.conf.Configuration.deprecation] session.id is deprecated. Instead, use dfs.metrics.session-id
2019-09-02 09:56:22 [INFO ]  [main]  [org.apache.hadoop.metrics.jvm.JvmMetrics] Initializing JVM Metrics with processName=JobTracker, sessionId=
2019-09-02 09:56:22 [WARN ]  [main]  [org.apache.hadoop.mapreduce.JobResourceUploader] Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2019-09-02 09:56:22 [WARN ]  [main]  [org.apache.hadoop.mapreduce.JobResourceUploader] No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-09-02 09:56:22 [INFO ]  [main]  [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] Total input paths to process : 3
2019-09-02 09:56:22 [INFO ]  [main]  [org.apache.hadoop.mapreduce.JobSubmitter] number of splits:3
2019-09-02 09:56:22 [INFO ]  [main]  [org.apache.hadoop.mapreduce.JobSubmitter] Submitting tokens for job: job_local1888565320_0001
2019-09-02 09:56:23 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job] The url to track the job: http://localhost:8080/
2019-09-02 09:56:23 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job] Running job: job_local1888565320_0001
2019-09-02 09:56:23 [INFO ]  [Thread-3]  [org.apache.hadoop.mapred.LocalJobRunner] OutputCommitter set in config null
2019-09-02 09:56:23 [INFO ]  [Thread-3]  [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] File Output Committer Algorithm version is 1
2019-09-02 09:56:23 [INFO ]  [Thread-3]  [org.apache.hadoop.mapred.LocalJobRunner] OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2019-09-02 09:56:23 [INFO ]  [Thread-3]  [org.apache.hadoop.mapred.LocalJobRunner] Waiting for map tasks
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] Starting task: attempt_local1888565320_0001_m_000000_0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] File Output Committer Algorithm version is 1
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] ProcfsBasedProcessTree currently is supported only on Linux.
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task]  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@e65aa68
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Processing split: hdfs://192.168.56.202:9000/user/root/invertindex/input/file2.txt:0+82
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] (EQUATOR) 0 kvi 26214396(104857584)
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] mapreduce.task.io.sort.mb: 100
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] soft limit at 83886080
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] bufstart = 0; bufvoid = 104857600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] kvstart = 26214396; length = 6553600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] 
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Starting flush of map output
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Spilling map output
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] bufstart = 0; bufend = 214; bufvoid = 104857600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] kvstart = 26214396(104857584); kvend = 26214356(104857424); length = 41/6553600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Finished spill 0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task] Task:attempt_local1888565320_0001_m_000000_0 is done. And is in the process of committing
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] map
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task] Task 'attempt_local1888565320_0001_m_000000_0' done.
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] Finishing task: attempt_local1888565320_0001_m_000000_0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] Starting task: attempt_local1888565320_0001_m_000001_0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] File Output Committer Algorithm version is 1
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] ProcfsBasedProcessTree currently is supported only on Linux.
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task]  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@4012ecfc
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Processing split: hdfs://192.168.56.202:9000/user/root/invertindex/input/file3.txt:0+67
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] (EQUATOR) 0 kvi 26214396(104857584)
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] mapreduce.task.io.sort.mb: 100
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] soft limit at 83886080
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] bufstart = 0; bufvoid = 104857600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] kvstart = 26214396; length = 6553600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] 
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Starting flush of map output
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Spilling map output
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] bufstart = 0; bufend = 175; bufvoid = 104857600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] kvstart = 26214396(104857584); kvend = 26214364(104857456); length = 33/6553600
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Finished spill 0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task] Task:attempt_local1888565320_0001_m_000001_0 is done. And is in the process of committing
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] map
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task] Task 'attempt_local1888565320_0001_m_000001_0' done.
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] Finishing task: attempt_local1888565320_0001_m_000001_0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] Starting task: attempt_local1888565320_0001_m_000002_0
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] File Output Committer Algorithm version is 1
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] ProcfsBasedProcessTree currently is supported only on Linux.
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task]  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@157add31
2019-09-02 09:56:23 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Processing split: hdfs://192.168.56.202:9000/user/root/invertindex/input/file1.txt:0+52
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] (EQUATOR) 0 kvi 26214396(104857584)
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] mapreduce.task.io.sort.mb: 100
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] soft limit at 83886080
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] bufstart = 0; bufvoid = 104857600
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] kvstart = 26214396; length = 6553600
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] 
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Starting flush of map output
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Spilling map output
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] bufstart = 0; bufend = 136; bufvoid = 104857600
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] kvstart = 26214396(104857584); kvend = 26214372(104857488); length = 25/6553600
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.MapTask] Finished spill 0
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task] Task:attempt_local1888565320_0001_m_000002_0 is done. And is in the process of committing
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] map
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.Task] Task 'attempt_local1888565320_0001_m_000002_0' done.
2019-09-02 09:56:24 [INFO ]  [LocalJobRunner Map Task Executor #0]  [org.apache.hadoop.mapred.LocalJobRunner] Finishing task: attempt_local1888565320_0001_m_000002_0
2019-09-02 09:56:24 [INFO ]  [Thread-3]  [org.apache.hadoop.mapred.LocalJobRunner] map task executor complete.
2019-09-02 09:56:24 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job] Job job_local1888565320_0001 running in uber mode : false
2019-09-02 09:56:24 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job]  map 100% reduce 0%
2019-09-02 09:56:24 [INFO ]  [Thread-3]  [org.apache.hadoop.mapred.LocalJobRunner] Waiting for reduce tasks
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.LocalJobRunner] Starting task: attempt_local1888565320_0001_r_000000_0
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] File Output Committer Algorithm version is 1
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] ProcfsBasedProcessTree currently is supported only on Linux.
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Task]  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@27512f1a
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.ReduceTask] Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4ac60620
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] MergerManager: memoryLimit=1265788544, maxSingleShuffleLimit=316447136, mergeThreshold=835420480, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2019-09-02 09:56:24 [INFO ]  [EventFetcher for fetching Map Completion Events]  [org.apache.hadoop.mapreduce.task.reduce.EventFetcher] attempt_local1888565320_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.LocalFetcher] localfetcher#1 about to shuffle output of map attempt_local1888565320_0001_m_000001_0 decomp: 88 len: 92 to MEMORY
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput] Read 88 bytes from map-output for attempt_local1888565320_0001_m_000001_0
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] closeInMemoryFile -> map-output of size: 88, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->88
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.LocalFetcher] localfetcher#1 about to shuffle output of map attempt_local1888565320_0001_m_000002_0 decomp: 88 len: 92 to MEMORY
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput] Read 88 bytes from map-output for attempt_local1888565320_0001_m_000002_0
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] closeInMemoryFile -> map-output of size: 88, inMemoryMapOutputs.size() -> 2, commitMemory -> 88, usedMemory ->176
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.LocalFetcher] localfetcher#1 about to shuffle output of map attempt_local1888565320_0001_m_000000_0 decomp: 88 len: 92 to MEMORY
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput] Read 88 bytes from map-output for attempt_local1888565320_0001_m_000000_0
2019-09-02 09:56:24 [INFO ]  [localfetcher#1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] closeInMemoryFile -> map-output of size: 88, inMemoryMapOutputs.size() -> 3, commitMemory -> 176, usedMemory ->264
2019-09-02 09:56:24 [INFO ]  [EventFetcher for fetching Map Completion Events]  [org.apache.hadoop.mapreduce.task.reduce.EventFetcher] EventFetcher is interrupted.. Returning
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.LocalJobRunner] 3 / 3 copied.
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Merger] Merging 3 sorted segments
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Merger] Down to the last merge-pass, with 3 segments left of total size: 234 bytes
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] Merged 3 segments, 264 bytes to disk to satisfy reduce memory limit
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] Merging 1 files, 264 bytes from disk
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] Merging 0 segments, 0 bytes from memory into reduce
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Merger] Merging 1 sorted segments
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Merger] Down to the last merge-pass, with 1 segments left of total size: 250 bytes
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.LocalJobRunner] 3 / 3 copied.
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.conf.Configuration.deprecation] mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Task] Task:attempt_local1888565320_0001_r_000000_0 is done. And is in the process of committing
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.LocalJobRunner] 3 / 3 copied.
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Task] Task attempt_local1888565320_0001_r_000000_0 is allowed to commit now
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] Saved output of task 'attempt_local1888565320_0001_r_000000_0' to hdfs://192.168.56.202:9000/user/root/invertindex/output/_temporary/0/task_local1888565320_0001_r_000000
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.LocalJobRunner] reduce > reduce
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.Task] Task 'attempt_local1888565320_0001_r_000000_0' done.
2019-09-02 09:56:24 [INFO ]  [pool-6-thread-1]  [org.apache.hadoop.mapred.LocalJobRunner] Finishing task: attempt_local1888565320_0001_r_000000_0
2019-09-02 09:56:24 [INFO ]  [Thread-3]  [org.apache.hadoop.mapred.LocalJobRunner] reduce task executor complete.
2019-09-02 09:56:25 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job]  map 100% reduce 100%
2019-09-02 09:56:25 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job] Job job_local1888565320_0001 completed successfully
2019-09-02 09:56:25 [INFO ]  [main]  [org.apache.hadoop.mapreduce.Job] Counters: 35
	File System Counters
		FILE: Number of bytes read=4510
		FILE: Number of bytes written=1098080
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=633
		HDFS: Number of bytes written=178
		HDFS: Number of read operations=37
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=10
	Map-Reduce Framework
		Map input records=11
		Map output records=27
		Map output bytes=525
		Map output materialized bytes=276
		Input split bytes=387
		Combine input records=27
		Combine output records=12
		Reduce input groups=4
		Reduce shuffle bytes=276
		Reduce input records=12
		Reduce output records=4
		Spilled Records=24
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=8
		Total committed heap usage (bytes)=1485307904
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=201
	File Output Format Counters 
		Bytes Written=178

View the results of the operation:

What's different from the normal MapReduce program is that we specify a combine process in the way we display it between map and reduce. In this process, we modify < key, value> to make it simpler in the reduce phase. In fact, the ordinary MapReduce program also has the process of combine, but this combine should not be ignored, because its default implementation is the implementation of reduce.

Posted by cmaclennan on Mon, 07 Oct 2019 18:50:03 -0700