Big data tutorial (8.2) wordcount program principle and code implementation / operation

Keywords: Big Data Hadoop Java Apache CentOS

Last blog shared mapreduce's programming ideas. In this section, the blogger will take his friends to understand the principle of wordcount program and the details of code implementation / operation. Through this section, we can have a general understanding of map reduce program. In fact, map and reduce programs in hadoop are only two of them. The rest components (such as input/output) can also be rewritten. By default, the default component is used.

1. Implementation of wordcount statistical program:

WordcountMapper (map task business implementation)

package com.empire.hadoop.mr.wcdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * KEYIN: By default, it is the starting offset of a line of text read by the mr frame, Long,
 * But in hadoop, it has its own more compact serialization interface, so instead of using Long directly, it uses LongWritable
 * VALUEIN:By default, it is the content of a line of Text read by mr frame, String, as above, with Text
 * KEYOUT: It is the key in the output data after the user-defined logic processing is completed. Here is the word, String, the same as above, and Text
 * VALUEOUT: It is the value in the output data after the user-defined logic processing is completed. Here is the number of words, Integer, ditto, and IntWritable
 * 
 * @author
 */

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    /**
     * map The business logic of the phase is written in the custom map() method. maptask will call our custom map() method once for each row of input data
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //Convert the text content passed to us by maptask to String first
        String line = value.toString();
        //Cut the line into words according to the space
        String[] words = line.split(" ");

        //Output word as < word, 1 >
        for (String word : words) {
            //Use the word as the key and the number of times 1 as the value for subsequent data distribution. You can distribute according to the word so that the same word can get the same reduce task
            context.write(new Text(word), new IntWritable(1));
        }
    }

}

WordcountReducer (implementation of reduce business code)

package com.empire.hadoop.mr.wcdemo;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * KEYIN, VALUEIN The keyout and valueout types corresponding to the mapper output correspond to keyout and valueout
 * Is the output data type of the custom reduce logic processing result KEYOUT is the word VLAUEOUT is the total number of times
 * 
 * @author
 */
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    /**
     * <angelababy,1><angelababy,1><angelababy,1><angelababy,1><angelababy,1>
     * <hello,1><hello,1><hello,1><hello,1><hello,1><hello,1>
     * <banana,1><banana,1><banana,1><banana,1><banana,1><banana,1>
     * Enter the key, which is a group of keys with the same word kv pair
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {

        int count = 0;
        /*
         * Iterator<IntWritable> iterator = values.iterator();
         * while(iterator.hasNext()){ count += iterator.next().get(); }
         */

        for (IntWritable value : values) {

            count += value.get();
        }

        context.write(key, new IntWritable(count));

    }

}

WordcountDriver (the program that submits yarn)

package com.empire.hadoop.mr.wcdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * The client of a yarn cluster needs to encapsulate the relevant running parameters of our mr program, specify the jar package and submit it to yarn
 * 
 * @author
 */
public class WordcountDriver {

    public static void main(String[] args) throws Exception {

        if (args == null || args.length == 0) {
            args = new String[2];
            args[0] = "hdfs://master:9000/wordcount/input/wordcount.txt";
            args[1] = "hdfs://master:9000/wordcount/output8";
        }

        Configuration conf = new Configuration();

        //The settings are useless??????
        //		conf.set("HADOOP_USER_NAME", "hadoop");
        //		conf.set("dfs.permissions.enabled", "false");

        /*
         * conf.set("mapreduce.framework.name", "yarn");
         * conf.set("yarn.resoucemanager.hostname", "mini1");
         */
        Job job = Job.getInstance(conf);

        /* job.setJar("/home/hadoop/wc.jar"); */
        //Specify the local path of the jar package of this program
        job.setJarByClass(WordcountDriver.class);

        //Specify the mapper/Reducer business class to be used by this business job
        job.setMapperClass(WordcountMapper.class);
        job.setReducerClass(WordcountReducer.class);

        //Specifies the kv type of mapper output data
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //Specifies the kv type of the final output data
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //Specify the directory of the original input file of the job
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        //Specify the directory where the output of the job is located
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        //Submit the relevant parameters configured in the job and the jar package of the java class used in the job to yarn for running
        /* job.submit(); */
        boolean res = job.waitForCompletion(true);
        //In the linux shell script, if the previous command returns 0, it means success; otherwise, it means failure
        System.exit(res ? 0 : 1);

    }

}

Run mapreduce

Packaging of jar

Upload (2) to hadoop cluster and run

#Upload jar

Alt+p
lcd d:/
put  wordcount_aaron.jar

#Preparing data files for hadoop processing

cd /home/hadoop/apps/hadoop-2.9.1
hadoop fs  -mkdir -p /wordcount/input
hadoop fs -put  LICENSE.txt NOTICE.txt /wordcount/input

#Run the wordcount program

hadoop jar wordcount_aaron.jar  com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/outputs

Operation rendering:

[hadoop@centos-aaron-h1 ~]$  hadoop jar wordcount_aaron.jar  com.empire.hadoop.mr.wcdemo.WordcountDriver /wordcount/input /wordcount/output
18/11/19 22:48:54 INFO client.RMProxy: Connecting to ResourceManager at centos-aaron-h1/192.168.29.144:8032
18/11/19 22:48:55 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/11/19 22:48:55 INFO input.FileInputFormat: Total input files to process : 2
18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1280)
        at java.lang.Thread.join(Thread.java:1354)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1280)
        at java.lang.Thread.join(Thread.java:1354)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/19 22:48:55 INFO mapreduce.JobSubmitter: number of splits:2
18/11/19 22:48:55 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/11/19 22:48:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1542637441480_0002
18/11/19 22:48:56 INFO impl.YarnClientImpl: Submitted application application_1542637441480_0002
18/11/19 22:48:56 INFO mapreduce.Job: The url to track the job: http://centos-aaron-h1:8088/proxy/application_1542637441480_0002/
18/11/19 22:48:56 INFO mapreduce.Job: Running job: job_1542637441480_0002
18/11/19 22:49:03 INFO mapreduce.Job: Job job_1542637441480_0002 running in uber mode : false
18/11/19 22:49:03 INFO mapreduce.Job:  map 0% reduce 0%
18/11/19 22:49:09 INFO mapreduce.Job:  map 100% reduce 0%
18/11/19 22:49:14 INFO mapreduce.Job:  map 100% reduce 100%
18/11/19 22:49:15 INFO mapreduce.Job: Job job_1542637441480_0002 completed successfully
18/11/19 22:49:15 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=241219
                FILE: Number of bytes written=1074952
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=122364
                HDFS: Number of bytes written=35348
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=7588
                Total time spent by all reduces in occupied slots (ms)=3742
                Total time spent by all map tasks (ms)=7588
                Total time spent by all reduce tasks (ms)=3742
                Total vcore-milliseconds taken by all map tasks=7588
                Total vcore-milliseconds taken by all reduce tasks=3742
                Total megabyte-milliseconds taken by all map tasks=7770112
                Total megabyte-milliseconds taken by all reduce tasks=3831808
        Map-Reduce Framework
                Map input records=2430
                Map output records=19848
                Map output bytes=201516
                Map output materialized bytes=241225
                Input split bytes=239
                Combine input records=0
                Combine output records=0
                Reduce input groups=2794
                Reduce shuffle bytes=241225
                Reduce input records=19848
                Reduce output records=2794
                Spilled Records=39696
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=332
                CPU time spent (ms)=2830
                Physical memory (bytes) snapshot=557314048
                Virtual memory (bytes) snapshot=2538102784
                Total committed heap usage (bytes)=259411968
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=122125
        File Output Format Counters 
                Bytes Written=35348

Run results:

#View processing result file
hadoop fs -ls /wordcount/output
hadoop fs -cat /wordcount/output/part-r-00000|more

Problem solving:

18/11/19 22:48:55 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1280)
        at java.lang.Thread.join(Thread.java:1354)
        at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
        at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)

The above error occurred because we didn't create the new hdfs directory according to the official document, which is not a big problem; the blogger didn't affect the normal use;

Solution:

#Create directory
hdfs dfs -mkdir -p /user/hadoop
hdfs dfs -put NOTICE.txt  LICENSE.txt /user/hadoop

The last message is all the content of this blog post. If you think the blog post is good, please like it. If you are interested in big data technology of other servers of the blogger or the blogger himself, please pay attention to the blogger's blog, and you are welcome to communicate with the blogger at any time.

Posted by jek on Thu, 05 Dec 2019 09:47:53 -0800

Programmer Group

Big data tutorial (8.2) wordcount program principle and code implementation / operation

Hot Keywords