Fully Distributed Cluster of Hadoop

Keywords: Java Hadoop ssh xml vim

Cluster environment:

  1. centOs6.8: hadoop102,hadoop103,hadoop104
  2. JDK version: jdk1.8.0_144
  3. hadoop version: Hadoop 2.7.2

First, prepare three clients (hadoop 102, Hadoop 103, Hadoop 104), close the firewall, and modify them to static ip and ip address mapping.

Configuration cluster

Writing Cluster Distribution Scripts

  1. Create a remote synchronization script xsync and place it in the new bin directory under the current user and configure it in PATH so that the script can be executed in any directory
  2. Script implementation
[kocdaniel@hadoop102 ~]$ mkdir bin
[kocdaniel@hadoop102 ~]$ cd bin/
[kocdaniel@hadoop102 bin]$ vim xsync

Write the following script code in the file

#!/bin/bash
#1 Get the number of input parameters, if no parameters, exit directly
pcount=$#
if((pcount==0)); then
echo no args;
exit;
fi

#2 Get the file name
p1=$1
fname=`basename $p1`
echo fname=$fname

#3 Get the directory to the absolute path - P points to the actual physical address to prevent soft connection
pdir=`cd -P $(dirname $p1); pwd`
echo pdir=$pdir

#4 Get the current user name
user=`whoami`

#5 cycle
for((host=103; host<105; host++)); do
        echo ------------------- hadoop$host --------------
        rsync -rvl $pdir/$fname $user@hadoop$host:$pdir
done
  1. Modify the script xsync with execution authority and call the script to copy the script to 103 and 104 nodes
[kocdaniel@hadoop102 bin]$ chmod 777 xsync
[kocdaniel@hadoop102 bin]$ xsync /home/atguigu/bin

Cluster configuration

  1. Cluster deployment planning
hadoop102 hadoop103 hadoop104
HDFS NameNode DataNode DataNode SecondaryNameNode DataNode
YARN NodeManager ResourceManager NodeManager NodeManager

Due to limited computer configuration, only three virtual machines can be used. Clusters are planned according to needs in the working environment.

  1. Configuration cluster

Switch to the Hadoop installation directory / etc/hadoop/

  • Configure core-site.xml
[kocdaniel@hadoop102 hadoop]$ vim core-site.xml
# Write the following in the file
<!-- Appoint HDFS in NameNode Address -->
<property>
    <name>fs.defaultFS</name>
      <value>hdfs://hadoop102:9000</value>
</property>

<!-- Appoint Hadoop Storage directory of files generated at runtime -->
<property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/module/hadoop-2.7.2/data/tmp</value>
</property>
  • HDFS configuration file

    • Configure hadoop-env.sh
    [kocdaniel@hadoop102 hadoop]$ vim hadoop-env.sh
    export JAVA_HOME=/opt/module/jdk1.8.0_144

Note: We have configured JAVA_HOME in the / etc/profile file. Why do we need to configure JAVA_HOME here?

Answer: Because Hadoop runs as a daemon( A daemon is a process that runs in the background and is not controlled by any terminal. From Baidu Encyclopedia ) It is precisely because it runs in the background and does not accept any terminal control, so it can not read the environment variables we have configured, so we need to configure them separately here.

    • Configure hdfs-site.xml
    [kocdaniel@hadoop102 hadoop]$ vim hdfs-site.xml
    # Write the following configuration
    <!-- The number of configuration copies is 3, and the default is 3, so this can also be deleted. -->
    <property>
            <name>dfs.replication</name>
            <value>3</value>
    </property>
    
    <!-- Appoint Hadoop Auxiliary Name Node Host Configuration -->
    <property>
          <name>dfs.namenode.secondary.http-address</name>
          <value>hadoop104:50090</value>
    </property>
    • YARN Profile

      • Configure yarn-env.sh
      [kocdaniel@hadoop102 hadoop]$ vim yarn-env.sh
      export JAVA_HOME=/opt/module/jdk1.8.0_144
      • Configure yarn-site.xml
      [kocdaniel@hadoop102 hadoop]$ vi yarn-site.xml
      # Add the following configuration
      <!-- Reducer How to get data -->
      <property>
              <name>yarn.nodemanager.aux-services</name>
              <value>mapreduce_shuffle</value>
      </property>
      
      <!-- Appoint YARN Of ResourceManager Address -->
      <property>
              <name>yarn.resourcemanager.hostname</name>
              <value>hadoop103</value>
      </property>
    • MapReduce configuration file

      • Configure mapred-env.sh
      [kocdaniel@hadoop102 hadoop]$ vim mapred-env.sh
      export JAVA_HOME=/opt/module/jdk1.8.0_144
      
      • Configure mapred-site.xml
      # For the first configuration, you need to rename mapred-site.xml.template to mapred-site.xml
      [kocdaniel@hadoop102 hadoop]$ cp mapred-site.xml.template mapred-site.xml
      [kocdaniel@hadoop102 hadoop]$ vim mapred-site.xml
      # Add the following configuration to the file
      <!-- Appoint MR Run in Yarn upper -->
      <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
      </property>
      
    1. Synchronize the configured files to hadoop103 and hadoop104 nodes using cluster distribution scripts
    [kocdaniel@hadoop102 hadoop]$ xsync /opt/module/hadoop-2.7.2/
    
    • It's better to check the synchronization results after the synchronization is completed to avoid errors.

    Single point start

    1. If it's the first time you start, you need to format the namenode, otherwise skip this step
    [kocdaniel@hadoop102 hadoop-2.7.2]$ hadoop namenode -format
    
    • Formatting should pay attention to:

      • Only the first startup needs to be formatted, and do not always format later. Otherwise, inconsistent cluster IDS between namenode and datanode will occur, which will lead to the failure of datanode startup.
      • Correct formatting posture:

        • When the first formatting is performed, the data folder is generated in the hadoop installation directory, which generates the information of the namenode.
        • After starting namenode and datanode, log folders for logs will also be generated in the same directory
        • So before formatting, you need to delete these two folders, then format them, and finally start the namenode and datanode.
    1. Start namenode on hadoop102
    [kocdaniel@hadoop102 hadoop-2.7.2]$ hadoop-daemon.sh start namenode
    [kocdaniel@hadoop102 hadoop-2.7.2]$ jps
    3461 NameNode
    
    1. Start DataNode on Hadoop 102, Hadoop 103 and Hadoop 104, respectively
    [kocdaniel@hadoop102 hadoop-2.7.2]$ hadoop-daemon.sh start datanode
    [kocdaniel@hadoop102 hadoop-2.7.2]$ jps
    3461 NameNode
    3608 Jps
    3561 DataNode
    [kocdaniel@hadoop103 hadoop-2.7.2]$ hadoop-daemon.sh start datanode
    [kocdaniel@hadoop103 hadoop-2.7.2]$ jps
    3190 DataNode
    3279 Jps
    [kocdaniel@hadoop104 hadoop-2.7.2]$ hadoop-daemon.sh start datanode
    [kocdaniel@hadoop104 hadoop-2.7.2]$ jps
    3237 Jps
    3163 DataNode
    
    1. Visithadoop102:50070 View results
    • But there is a problem with the above single-point startup:

      • Every time one node starts, what if the number of nodes increases to 1000?

    Configure ssh Secret-Free Logon

    1. Configure ssh

      • ssh ip of another node can be switched to another machine, but you have to enter a password
    2. Secret-free ssh configuration

      • Principle of Secret-Free Login

    • Generate private and public keys on the host hadoop102 that configures namenode

      • Switch directory to / home / username /. ssh/

        [kocdaniel@hadoop102 .ssh]$ ssh-keygen -t rsa
        
     - Then click (three carriages return) and two files will be generated. id_rsa(Private key, id_rsa.pub(Public key)
     - Copy the public key to the target machine to be secret-free login
    
     ```shell
     [kocdaniel@hadoop102 .ssh]$ ssh-copy-id hadoop103
     [kocdaniel@hadoop102 .ssh]$ ssh-copy-id hadoop104
     # Note: ssh access itself needs to enter a password, so we need to copy the public key to 102 as well.
     [kocdaniel@hadoop102 .ssh]$ ssh-copy-id hadoop102
     
     ```
    
    • Similarly, do the same on the host hadoop103 that configures resource manager, and then cluster.

    Clustering

    1. Configure slaves

      • Switch directory to: Hadoop installation directory / etc/hadoop/
      • Add the following to the slaves file in the directory
      [kocdaniel@hadoop102 hadoop]$ vim slaves
      # Note that there should be no blanks at the end and no blank lines in the file.
      hadoop102
      hadoop103
      hadoop104
      
      • Synchronize configuration files for all nodes
      [kocdaniel@hadoop102 hadoop]$ xsync slaves
      
    1. Start cluster

      • Again, if it's the first boot, it needs to be formatted
      • Start HDFS
      [kocdaniel@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh
      
      # View startup results, consistent with cluster planning (configuration file)
      [atguigu@hadoop102 hadoop-2.7.2]$ jps
      4166 NameNode
      4482 Jps
      4263 DataNode
      
      [atguigu@hadoop103 hadoop-2.7.2]$ jps
      3218 DataNode
      3288 Jps
      
      [atguigu@hadoop104 hadoop-2.7.2]$ jps
      3221 DataNode
      3283 SecondaryNameNode
      3364 Jps
      
      • Start YARN
      # Note: If NameNode and ResourceManger are not the same machine, YARN cannot be started on NameNode, and YARN should be started on the machine where Resouce Manager is located.
      [kocdaniel@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh
      
    2. Viewing related information on the web side

    Posted by mcbeckel on Wed, 25 Sep 2019 04:17:21 -0700