Big Data Learning 06_Hadoop: An Overview of MapReduce

Keywords: Hadoop Apache Java Mobile

Big Data Learning 06_Hadoop: An Overview of MapReduce

Overview of MapReduce

MapReduce Core


The MapReduce operator is divided into at least two phases

  1. The first phase of the MapTask concurrent instance runs completely in parallel and is not related to each other
  2. The second phase of the ReduceTask concurrent instance runs completely in parallel and is independent of each other. However, their data depends on the output of all MapTask concurrent instances in the previous phase.
  3. The MapReduce programming model can only contain one Map phase and one Reduce phase, and if the business logic is complex, it can only run multiple MapReduce programs in serial.

A complete MapReduce program has three types of instance processes at distributed runtime:

  1. MrAppMaster: Responsible for process scheduling and state coordination throughout the program
  2. MapTask: Is responsible for the entire data processing process in the Map phase
  3. ReduceTask: Responsible for the entire data processing process in the Reduce phase

MapReduce Programming Specification

  1. Mapper class

    1. User-defined Mapper classes inherit Mapper parent classes from jar packages
    2. Mapper's input data is in the form of KV pairs (KV types can be customized)
    3. Business logic in Mapper is written in the map() method
    4. Mapper's output data is in the form of KV pairs (KV types can be customized)
    5. The MapTask process calls the map() method once for each group <K, V>
  2. Reducer class

    1. User-defined Reducer class inherits Reducer parent class from jar package
    2. Reducer's input data type corresponds to Mapper's output data type
    3. Reducer's business logic is written in the reduce() method
    4. The ReduceTask process calls the reduce() method once for each group <k, v>
  3. Driver class
    A client equivalent to a YARN cluster that submits our entire program to the YARN cluster, submitting a job object encapsulating the relevant operational parameters of the MapReduce program

  4. MapReduce has the following data serialization types

    Java Type Hadoop type
    boolean BooleanWritable
    byte ByteWritable
    int IntWritable
    float FloatWritable
    long LongWritable
    double DoubleWritable
    String Text
    map MapWritable
    array ArrayWritable

MapReduce Case Practice 1: WordCount

  1. Environmental preparation:

    1. Add the location of the jar package for hadoop to the system path
    2. Create a new Maven project, add the following dependencies
    <dependencies>
    	<dependency>
    		<groupId>junit</groupId>
    		<artifactId>junit</artifactId>
    		<version>RELEASE</version>
    	</dependency>
    	<dependency>
    		<groupId>org.apache.logging.log4j</groupId>
    		<artifactId>log4j-core</artifactId>
    		<version>2.8.2</version>
    	</dependency>
    	<dependency>
    		<groupId>org.apache.hadoop</groupId>
    		<artifactId>hadoop-common</artifactId>
    		<version>2.7.7</version>
    	</dependency>
    	<dependency>
    		<groupId>org.apache.hadoop</groupId>
    		<artifactId>hadoop-client</artifactId>
    		<version>2.7.7</version>
    	</dependency>
    	<dependency>
    		<groupId>org.apache.hadoop</groupId>
    		<artifactId>hadoop-hdfs</artifactId>
    		<version>2.7.7</version>
    	</dependency>
    </dependencies>
    
  2. Write a program

    1. Writing Mapper classes
      package cn.maoritian.mapreduce;
      
      import java.io.IOException;
      
      import org.apache.hadoop.io.IntWritable;
      import org.apache.hadoop.io.LongWritable;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Mapper;
      
      public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
      	// Four type parameters of Mapper object KEYIN, VALUEIN, KEYOUT, VALUEOUT
      	// 	KEYIN: Type of input key: LongWritable
      	// 	VALUEIN: Type of input value: Text (line content)
      	// 	KEYOUT: Type of output key: Text (word string)
      	// 	VALUEOUT: Type of output value: IntWritable (occurrences)
      
      	Text k = new Text();
      	IntWritable v = new IntWritable(1);
      	
      	@Override
      	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      		// 1 Get a row
      		String line = value.toString();
      		// 2 Cut
      		String[] words = line.split(" ");
      		// 3 Output
      		for (String word : words) {
      			k.set(word);
      			context.write(k, v);
      		}
      	}
      }
      
    2. Write the Reducer class:
      package cn.maoritian.mapreduce;
      
      import java.io.IOException;
      
      import org.apache.hadoop.io.IntWritable;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Reducer;
      
      public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
      	// Four type parameters of Reducer object KEYIN, VALUEIN, KEYOUT, VALUEOUT
      	// 	KEYIN: Type of input key: Text (word string)
      	// 	VALUEIN: Type of input value: IntWritable (occurrences)
      	// 	KEYOUT: Type of output key: Text (word string)
      	// 	VALUEOUT: Type of output value: IntWritable (summation of occurrences)
      
      	int sum;
      	IntWritable v = new IntWritable();
      
      	@Override
      	protected void reduce(Text key, Iterable<IntWritable> values, Context context)
      			throws IOException, InterruptedException {
      
      		// 1. Accumulative Sum
      		sum = 0;
      		for (IntWritable count : values) {
      			sum += count.get();
      		}
      
      		// 2. Output
      		v.set(sum);
      		context.write(key, v);
      	}
      }
      
    3. Write the Driver class:
      package cn.maoritian.mapreduce;
      
      import java.io.IOException;
      
      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.IntWritable;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Job;
      import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
      import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
      
      public class WordcountDriver {
      
      	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
      		// 1. Get configuration information and encapsulate tasks
      		Configuration configuration = new Configuration();
      		Job job = Job.getInstance(configuration);
      
      		// 2. Set the jar load path to be the path of the driver class
      		job.setJarByClass(WordcountDriver.class);
      
      		// 3. Set up map and reduce classes
      		job.setMapperClass(WordcountMapper.class);
      		job.setReducerClass(WordcountReducer.class);
      
      		// 4. Set map output
      		job.setMapOutputKeyClass(Text.class);
      		job.setMapOutputValueClass(IntWritable.class);
      
      		// 5. Set the final output kv type
      		job.setOutputKeyClass(Text.class);
      		job.setOutputValueClass(IntWritable.class);
      		https://github.com/srccodes/hadoop-common-2.2.0-bin/tree/master/
      		// 6 Set input and output paths
      		FileInputFormat.setInputPaths(job, new Path(args[0]));
      		FileOutputFormat.setOutputPath(job, new Path(args[1]));
      
      		// 7 Submit and Exit
      		boolean result = job.waitForCompletion(true);
      
      		System.exit(result ? 0 : 1);
      	}
      }
      

    There are a series of egg-handling problems when running on Windows. This question

  3. Package the program into a jar package, right-click on the project, click Run as->Maven install to package, and after packaging, the generated jar package will be found in the target directory, renamed wc.jar, and transferred to the hadoop directory.
    Use the command below to execute the MapReduce program.

    hadoop jar wc.jar cn.maoritian.mapreduce.WordcountDriver /wcinput/input.txt /wcoutput/	
    

Hadoop serialization

  1. The meaning of Hadoop serialization, why not use Java serialization framework?
    MapReduce's data serialization type contains eight basic data types that we can use to serialize Java Bean data types by combining them.
    Java serialization is a heavyweight serialization framework (Serializable). Serialized objects come with a lot of extra information (various checks, headers, inheritance systems, etc.) which is not easy to transfer efficiently across the network. So Hadoop developed a serialization mechanism (Writable).

  2. Implementation of Hadoop Serialization Interface
    The steps to serialize the bean object are as follows:

    1. Implement Writable interface
    2. When deserializing, the null parameter constructor needs to be invoked reflectively, so the null parameter construct must be present
      public Bean Class name() {
      	super();
      }
      
    3. Override serialization method write()
      @Override
      public void write(DataOutput out) throws IOException {
      	out.writeLong(parameter1);
      	out.writeLong(parameter2);
      	out.writeLong(parameter3);
      }
      
    4. Override the deserialization method readFields()
      @Override
      public void readFields(DataInput in) throws IOException {
      	//parameter1 = in.readLong();
      	//parameter2 = in.readLong();
      	//parameter3 = in.readLong();
      }
      
    5. Note that the order of deserialization is exactly the same as that of serialization
    6. To display the results in a file, you can override the toString() method, with fields separated by \t for subsequent use
    7. If you need to transfer custom bean s in a key, you also need to implement the Comparable interface, because the Shuffle process in the MapReduce box requires that keys must be sorted.
      @Override
      public int compareTo(FlowBean o) {
      	// Reverse order, from large to small
      	return this.hashCode() - o.hashCode();
      }
      

MapReduce Case Practice 2: FlowCount

  1. Demand: Count the total upstream traffic, downstream traffic and total traffic consumed by each mobile phone number:
    Input data format:
    7	13560436666	120.196.100.99	1116		954			200
     id Mobile number Network ip Up traffic Down traffic Network status code
    
    Output data format:
    13560436666		1116		954			2070
     Mobile Number Up Traffic Down Traffic Total Traffic
    
  2. Needs Analysis:
  3. Write a program:
    1. FlowBean object for writing traffic statistics
      import java.io.DataInput;
      import java.io.DataOutput;
      import java.io.IOException;
      
      import org.apache.hadoop.io.Writable;
      
      //1. Implement Writable interface
      public class FlowBean implements Writable {
      
      	private long upFlow;		// Upstream Flow
      	private long downFlow;		// Downstream Flow
      	private long sumFlow;		// Total Flow
      
      	//2. Reflection calls to null parameter constructors are required for deserialization
      	public FlowBean() {
      		super();
      	}
      
      	//3. Implement serialization methods
      	@Override
      	public void write(DataOutput out) throws IOException {
      		out.writeLong(upFlow);
      		out.writeLong(downFlow);
      		out.writeLong(sumFlow);
      	}
      
      	//4. Deserialization methods
      	//5. Deserialization methods must read in the same order as writing serialization methods
      	@Override
      	public void readFields(DataInput in) throws IOException {
      		this.upFlow = in.readLong();
      		this.downFlow = in.readLong();
      		this.sumFlow = in.readLong();
      	}
      
      	//6. To output the serialized object later, implement its toString() method
      	@Override
      	public String toString() {
      		return upFlow + "\t" + downFlow + "\t" + sumFlow;
      	}
      	
      	public void set(long upFlow, long downFlow, long sumFlow) {
      		this.upFlow = upFlow;
      		this.downFlow = downFlow;
      		this.sumFlow = sumFlow;
      	}
      }
      
    2. Writing Mapper classes
      import java.io.IOException;
      
      import org.apache.hadoop.io.LongWritable;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Mapper;
      
      import cn.maoritian.bean.FlowBean;
      
      public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
      	//Four type parameters of Mapper object KEYIN, VALUEIN, KEYOUT, VALUEOUT
      	//	KEYIN: Type of input key: LongWritable
      	//	VALUEIN: Type of input value: Text (line content)
      	//	KEYOUT: Type of output key: Text (mobile number)
      	//	VALUEOUT: Type of output value: FlowBean (traffic statistics)
      
      	Text k = new Text();
      	FlowBean v = new FlowBean();
      
      	@Override
      	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      		// 1. Get a row
      		String line = value.toString();
      
      		// 2. Cut to get fields
      		String[] fields = line.split("\t");
      		String phoneNum = fields[1]; 		// Phone number
      		long upFlow = Long.parseLong(fields[fields.length - 3]);	// Upstream Flow
      		long downFlow = Long.parseLong(fields[fields.length - 2]);	// Downstream Flow
      
      		// 3. Output
      		k.set(phoneNum);
      		v.set(downFlow, upFlow, upFlow + downFlow);
      		context.write(k, v);
      	}
      }
      
    3. Writing Reducer classes
      import java.io.IOException;
      
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Reducer;
      
      import cn.maoritian.bean.FlowBean;
      
      public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
      	// Four type parameters of Reducer object KEYIN, VALUEIN, KEYOUT, VALUEOUT
      	// KEYIN: Type of input key: Text (mobile number)
      	// VALUEIN: Type of input value: FlowBean (Traffic Statistics)
      	// KEYOUT: Type of output key: Text (mobile number)
      	// VALUEOUT: Type of output value: FlowBean (traffic statistics)
      
      	long sum_upFlow = 0;
      	long sum_downFlow = 0;
      	FlowBean resultBean = new FlowBean();
      
      	@Override
      	protected void reduce(Text key, Iterable<FlowBean> values, Context context)throws IOException, InterruptedException {
      
      		//1. Traversal summation
      		for (FlowBean flowBean : values) {
      			sum_upFlow += flowBean.getUpFlow();
      			sum_downFlow += flowBean.getDownFlow();
      		}
      
      		//2. Output
      		resultBean.set(sum_upFlow, sum_downFlow, sum_upFlow+sum_downFlow);
      		context.write(key, resultBean);
      	}
      }
      
    4. Writing Driver Driver Classes
      import java.io.IOException;
      
      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Job;
      import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
      import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
      
      import cn.maoritian.bean.FlowBean;
      
      public class FlowCountDriver {
      
      	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
      
      		// 1. Get configuration information and encapsulate tasks
      		Configuration configuration = new Configuration();
      		Job job = Job.getInstance(configuration);
      
      		// 2. Set the jar load path to be the path of the driver class
      		job.setJarByClass(FlowCountDriver.class);
      
      		// 3. Set up map and reduce classes
      		job.setMapperClass(FlowCountMapper.class);
      		job.setReducerClass(FlowCountReducer.class);
      
      		// 4. Set map output
      		job.setMapOutputKeyClass(Text.class);
      		job.setMapOutputValueClass(FlowBean.class);
      
      		// 5. Set the final output kv type
      		job.setOutputKeyClass(Text.class);
      		job.setOutputValueClass(FlowBean.class);
      
      		// 6. Set input and output paths
      		FileInputFormat.setInputPaths(job, new Path(args[0]));
      		FileOutputFormat.setOutputPath(job, new Path(args[1]));
      
      		// 7. Submit and Exit
      		boolean result = job.waitForCompletion(true);
      		System.exit(result ? 0 : 1);
      	}
      }
      

Posted by sufian on Fri, 03 May 2019 04:50:38 -0700