hadoop configuration and wordcount

Keywords: Java Hadoop JDK ssh

Most of the reference blogs are based on Hadoop 2.x and low-level java. The configuration process seems very simple to write. Looking at other people's blogs, it feels the same steps. But there are many problems when configuring the blog: data node can't start, the page can't display properly, data node inexplicably dies, resourcemanager can't start, nodemanager can't start, mapreduce process can't. The method connects to slave and so on. This process takes a lot of time to read the blog and log, and record it.

I installed four linux systems as nodes in the virtual machine. The required environment is the same. So I first configure one, and then copy it directly to the other three with the functions of the virtual machine.

Environmental Science:

  • Macos , Parallels Desktop
  • Linux 16.04
  • Jdk 1.8.0
  • Hadoop 3.2.0

Java environment configuration

Download the latest jdk compressed file on oracle official website and copy it to the installed target directory to extract:

sudo tar -zxvf jdk-12_linux-x64_bin.tar.gz
sudo rm jdk-12_linux-x64_bin.tar.gz

Then configure the environment variables. It can be written in ~/. bashrc or / etc/profile, where ~/. bashrc is in the user's home directory and only works for the current user, and / etc/profile is the environment variable for all users.

vim /etc/profile

Add jdk environment variable at the end

JAVA_HOME=/usr/lib/jdk-12
CLASSPATH=.:$JAVA_HOME/lib.tools.jar
PATH=$JAVA_HOME/bin:$PATH
export JAVA_HOME CLASSPATH PATH

Then source /etc/profile takes effect, and java - version checks whether the configuration is correct.

There was a problem when the resourcemanager was started later, and it was replaced with jdk8, the same process.

ssh key-free connection

Then install hadoop, the process is placed in the next part, after installation, copy and generate three virtual machines in the same environment. I use parallels, which is more stable and easy to use than others.

Then there's the distributed part. Pure distribution is very difficult to achieve. hadoop still uses a master to centralize the management of data nodes. master does not store data, but stores data in the data node, which is named slave1, slave2, slave3 data nodes, and the network connection is bridged. So master needs to be able to log in to slave without key. Add the ip address of the node (static ip can be configured to avoid reconfiguration when the ip changes):

vim /etc/hosts
192.168.31.26   master
192.168.31.136  slave1
192.168.31.47   slave2
192.168.31.122  slave3

vim /etc/hostname
master # Configure slave1, slave2, slave3, respectively

ping slave1 # test

Installation of ssh, which is very slow in the official source of ubuntu, I tried to switch to the source of Tsinghua and Aliyun in China, but there is no one, maybe there are different versions and so on. If you don't bother to take care of it, just wait patiently.

sudo apt-get install ssh

Then the public and private keys are generated:

ssh-keygen -t rsa

The default path here is. ssh in the user's home directory. Just go all the way back.

To enable each host to connect itself keyless:

cp .id_rsa.pub authorized_keys

Then, in order to enable the master to connect to slave keyless, the master's public key is appended to the authorized_keys of each slave.


Then test if you can connect properly:

ssh slave1

Install and configure hadoop

Download hadoop 3.2 from the official website and extract it to / usr/lib /. And assign read permissions to hadoop users

cd /usr/lib
sudo tar –xzvf hadoop-3.2.0.tar.gz
chown –R hadoop:hadoop hadoop #Allocate the read permission of the folder "hadoop" to ordinary Hadoop users
sudo rm -rf hadoop-3.2.0.tar.gz

Add environment variables:

HADOOP_HOME=/usr/lib/hadoop-3.2.0
PATH=$HADOOP_HOME/bin:$PATH
export HADOOP_HOME PATH

Next comes the most important configuration Hadoop section, which configures the following files under HADOOP_HOME/etc/hadoop/:

hadoop-env.sh

export JAVA_HOME=/usr/lib/jdk1.8.0_201

core-site.xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/lib/hadoop-3.2.0/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:9000</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>/usr/lib/hadoop-3.2.0/hdfs/name</value>
    </property>
    <property>
      <name>dfs.data.dir</name>
      <value>/usr/lib/hadoop-3.2.0/hdfs/data</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
      <name>yarn.resourcemanager.address</name>
      <value>master:8032</value>
    </property>
    <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master:8030</value>
    </property>
    <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master:8031</value>
    </property>
    <property>
      <name>yarn.resourcemanager.admin.address</name>
      <value>master:8033</value>
    </property>
    <property>
      <name>yarn.resourcemanager.webapp.address</name>
      <value>master:8088</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
        <property>
        <name>mapred.job.tracker</name>
        <value>master:49001</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/usr/lib/hadoop-3.2.0/var</value>
    </property>

        <property>
                <name>yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
        </property>
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
        </property>
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
        </property>
</configuration>

workers

slave1
slave2
slave3

After that, the configuration is completed, and then the entire folder is copied to the other three hosts.

start-up

Format namenode

hdfs namenode -format # The premise is that HADOOP_HOME has been added to the environment variable

If you see INFO as above, this step is successful. Then run the start script:

./sbin/start-all.sh # In hadoop 2.x, put it under. / bin /

To view Java processes with jps, the master should contain NameNode, Secondary NameNode, ResourceManager, and slave should contain DataNode, NodeManager. Common problems here include no data nodes, no access rights, resouecemanager can not start, and so on. Some of the reasons I write below, most of them are configuration problems, you can find the reason by looking at the log file.

master:9870 allows you to view the status of the cluster on a web page.

WordCount sample program

wordcount can be said to be "hello world" in the process of Hadoop learning. You can find the source code on the Internet or write it yourself. I use the sample program under the official $HADOOP_HOME/share/hadoop/mapreduce/.

First, I pass the input file to dfs. I have written two txt files with the words "hadoop", "hello" and "world". Then run the sample program:

hdfs dfs -mkdir /in
hdfs dfs -put ~/Desktop/file*.txt /in
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /in /out

Here you can see that map reduce is divided into map and reduce processes. Map reduce is divided into map, shuffle and reduce process. First, it divides large tasks into each node and calculates them separately. Then shuffle divides different key values into different nodes according to certain rules for integration, then submits tasks and reduces integration. View the results:

hdfs dfs -cat /out/part-r-00000

So far, the hadoop cluster environment has been installed correctly. The next step is to modify the wordcount code and play it by yourself, so you can write it by yourself.

Some problems encountered

  • When copying a configured folder, it accidentally copies it incorrectly, and copies it into a folder that was used when the previous configuration failed, resulting in the failure of datanode startup, but no prompt throughout. Google hasn't solved it for a long time. Later, I looked at the log file of datanode and found the wrong place. It was the core-site.xml that had a problem. After modification, it reformatted and started successfully.

    This sad story tells us that if something goes wrong, we should first look at the mislocation of the log file. Everyone's mistake is strange. Google is not omnipotent.

  • No resourcemanager and nodemanager: Check the log to find the reason for classNoFound(javax.XXXXXXX). It is found that due to some limitations of Java 9 or more, the API of javax is disabled by default. There are two solutions to this problem by referring to blogs:

    1. Add it to yarn-env.sh (but I've tried it out, and since I don't know java, I give up going into it)

      export YARN_RESOURCEMANAGER_OPTS="--add-modules=ALL-SYSTEM"
      export YARN_NODEMANAGER_OPTS="--add-modules=ALL-SYSTEM"
    2. Change to jdk8
  • When I first ran the wordcount program, I passed in the entire folder of $HADOOP_HOME/etc/hadoop as input. The result was wrong. According to the log, I found that there was insufficient memory, and each of my virtual machines had only 1G of memory. It can be seen that such a configuration can only be used as a familiar Hadoop distributed environment, and can not achieve the conditions to solve the problem.

Posted by dieselmachine on Sat, 18 May 2019 09:45:22 -0700