The way of Hadoop learning Mapreduce program completes wordcount

Keywords: Big Data Spark Hadoop Apache Java

Test text data used by the program:

Dear River
Dear River Bear Spark 
Car Dear Car Bear Car
Dear Car River Car 
Spark Spark Dear Spark 

1 main categories

(1) Maper class

The first is the custom Maper class code

public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        //fields: data representing a line of text: dear bear river
        String[] words = value.toString().split("\t");
        for (String word : words) {
            // Once for each word, output as intermediate result
            context.write(new Text(word), new IntWritable(1));
        }
    }
}

This Map class is a generic type. It has four parameter types, which specify the input key, input value, output key and output value types of the map() function. LongWritable: input key type, Text: input value type, Text: output key type, IntWritable: output value type
String [] words = value. Tostring(). Split ("\ T"); the value of words is Dear River Bear River
The input key is a long integer offset, which is used to find the data in the first row and the data in the next row. The input value is a row of text Dear River Bear River, the output key is the word bear, and the output value is the integer 1.
Hadoop itself provides a set of basic types that can optimize network serialization transmission without directly using java embedded types. These types are in the org.apache.hadoop.io package. Here we use the LongWritable type (equivalent to Java's Long type), the Text type (equivalent to Java's String type), and the IntWritable type (equivalent to Java's Integer type).
The parameters of the map() method are the input key and the input value. Take this program as an example, the input key LongWritable key is an offset, and the input value Text value is Dear Car Bear Car. First, we convert the Text value containing one line of input into Java String type, and then use substring() method to extract the columns of interest. The map() method also provides Context instances for writing output content.

(2) Reducer class

public class WordCountReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
    /*
        (River, 1)
        (River, 1)
        (River, 1)
        (Spark , 1)
        (Spark , 1)
        (Spark , 1)
        (Spark , 1)

        key: River
        value: List(1, 1, 1)
        key: Spark
        value: List(1, 1, 1,1)

    */
    public void reduce(Text key, Iterable<IntWritable> values,
                          Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable count : values) {
            sum += count.get();
        }
        context.write(key, new IntWritable(sum));// Output final results
    };
}

The Reduce task initially grabs data from the Map side according to the partition number as follows:
(River, 1)
(River, 1)
(River, 1)
(spark, 1)
(Spark , 1)
(Spark , 1)
(Spark , 1)
After treatment, the results are as follows:
key: hello value: List(1, 1, 1)
key: spark value: List(1, 1, 1,1)
So the parameters iteratable & lt; intwritable & gt; values of the reduce() function receive the values List(1, 1, 1) and List(1, 1, 1,1)

(3) Main function

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class WordCountMain {
    //If the MR program is executed locally in IDEA, you need to change the mapreduce.framework.name value in mapred-site.xml to local
    public static void main(String[] args) throws IOException,
            ClassNotFoundException, InterruptedException {
        if (args.length != 2 || args == null) {
            System.out.println("please input Path!");
            System.exit(0);
        }
        //System.setProperty("HADOOP_USER_NAME","hadoop2.7");
        Configuration configuration = new Configuration();
        //configuration.set("mapreduce.job.jar","/home/bruce/project/kkbhdp01/target/com.kaikeba.hadoop-1.0-SNAPSHOT.jar");
        //Call getInstance method to generate job instance
        Job job = Job.getInstance(configuration, WordCountMain.class.getSimpleName());
        // jar bag
        job.setJarByClass(WordCountMain.class);

        // Set input / output format through job
        // The default input format of MR is TextInputFormat, so the next two lines can be commented out
        // job.setInputFormatClass(TextInputFormat.class);
        // job.setOutputFormatClass(TextOutputFormat.class);
        // Set input / output path
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // Set the class to process the Map/Reduce phase
        job.setMapperClass(WordCountMap.class);
        //map combine reduces network outgoing traffic
        job.setCombinerClass(WordCountReduce.class);
        job.setReducerClass(WordCountReduce.class);

        //If the output kv pair types of map and reduce are the same, just set the output kv pair of reduce directly; if not, you need to set the output kv type of map and reduce respectively
        //job.setMapOutputKeyClass(.class)
        // job.setMapOutputKeyClass(Text.class);
        // job.setMapOutputValueClass(IntWritable.class);

        // Set the type of key/value of the final output of reduce task
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // Submit homework
        job.waitForCompletion(true);

    }
}

2 local operation

First, change the mapred-site.xml file configuration
Set the value of mapreduce.framework.name to local

Then run locally:

View results:

3 cluster operation

Method 1:

First packaging

Change the configuration file to yarn mode

Add local jar package location:

 Configuration configuration = new Configuration();
 configuration.set("mapreduce.job.jar","C:\\Users\\tanglei1\\IdeaProjects\\Hadooptang\\target");


Set to allow cross platform remote calls:

configuration.set("mapreduce.app-submission.cross-platform","true");


Modify input parameters:

Operation result:

Mode two:

Pack maven project and run mr program with command on the server side

hadoop jar com.kaikeba.hadoop-1.0-SNAPSHOT.jar
com.kaikeba.hadoop.wordcount.WordCountMain /tttt.txt  /wordcount11

Posted by SundayDriver on Fri, 27 Dec 2019 02:19:33 -0800