Construction of Hadoop Cluster

Keywords: Big Data Hadoop xml vim NodeManager

Article directory

1. Basic information
2. Installation process

1. Switch to hadoop account and decompress hadoop to the destination installation directory by tar-zxvf command:
2. Create tmpdir directory:
Configure hadoop-env.sh file:
4. Configure mapred-env.sh file:
5. Configure the core-site.xml file core-site.xml
Configure hdfs-site.xml file hdfs-site.xml
Configure mapred-site.xml file mapred-site.xml
Configure yarn-site.xml file: yarn-site.xml
9. Configure the environment variables that hadoop runs
10. Modify the slaves file:
11. Copy hadoop-2.7.3 to hadoop@test2 and hadoop@test2 machines on test and modify the environment variables in step 9 and perform the following actions:
12. Format namenode (only the first startup needs to be formatted!) Start hadoop and start the job history service:
13. Check the service of each machine, and input jps on test, test2 and test3 machines respectively:

Q&A
Core elements of hadoop

1. Basic information

Version 2.7.3
Install three machines
Account number hadoop
Source path/opt/software/hadoop-2.7.3.tar.gz
Target path/opt/hadoop->/opt/hadoop-2.7.3
Dependency relationship zookeeper

2. Installation process

1. Switch to hadoop account and decompress hadoop to the destination installation directory by tar-zxvf command:

[root@test opt]# su hadoop
[hadoop@test opt]$ cd /opt/software
[hadoop@test software]$  tar -zxvf hadoop-${version}.tar.gz  -C /opt
[hadoop@test software]$ cd /opt
[hadoop@test opt]$ ln -s /opt/hadoop-${version} /opt/hadoop

2. Create tmpdir directory:

[hadoop@test opt]$ cd  /opt/hadoop
[hadoop@test hadoop]$ mkdir -p tmpdir

Configure hadoop-env.sh file:

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop/
[hadoop@test hadoop]$ mkdir -p /opt/hadoop/pids
[hadoop@test hadoop]$ vim hadoop-env.sh

Add the following configuration to the hadoop-env.sh file:

export JAVA_HOME=/opt/java
export HADOOP_PID_DIR=/opt/hadoop/pids

4. Configure mapred-env.sh file:

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop/
[hadoop@test hadoop]$ vim mapred-env.sh

Add the following configuration to the mapred-env.sh file:
export JAVA_HOME=/opt/java

5. Configure the core-site.xml file core-site.xml

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop/
[hadoop@test hadoop]$  vim core-site.xml

Add the following configuration to the core-site.xml file:

<configuration>
<property>
//Temporary working directory of namenode
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/tmpdir</value>
    </property>
<property>
//The entry to hdfs tells namenode what the port number is on that machine.
        <name>fs.defaultFS</name>
        <value>hdfs://test:8020</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
    <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>
</configuration>

Configure hdfs-site.xml file hdfs-site.xml

If rnager has not been installed at the time of installation, the following code needs to be commented out in the file.

<property>
    <name>dfs.namenode.inode.attributes.provider.class</name>
    <value>org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer</value>
</property>

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop/
[hadoop@test hadoop]$ vim hdfs-site.xml

Add the following configuration to the hdfs-site.xml file:

<configuration>
<property>
#The number of replicas is generally less than or equal to the number of datanode s.
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/data/datanode</value>
    </property>
    <property> 
        <name>dfs.webhdfs.enabled</name> 
        <value>true</value> 
</property>
<property>
        <name>dfs.secondary.http.address</name>
        <value>test:50090</value>
 </property>
</configuration>

Configure mapred-site.xml file mapred-site.xml

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop/
[hadoop@test hadoop]$ vim mapred-site.xml

Add the following configuration to the mapred-site.xml file:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>test:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>test:19888</value>
    </property>
</configuration>

Configure yarn-site.xml file: yarn-site.xml

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop/
[hadoop@test hadoop]$ vim yarn-site.xml

Add the following configuration to the yarn-site.xml file:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>test:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>test:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>test:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>test:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>test:8088</value>
    </property>
<!-- Site specific YARN configuration properties -->
</configuration>

9. Configure the environment variables that hadoop runs

[hadoop@test hadoop]$ vim /etc/profile
export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH

When the configuration is successful, execute source/etc/profile to make the configuration effective

[hadoop@test hadoop]$ source /etc/profile

10. Modify the slaves file:

[hadoop@test hadoop]$ cd /opt/hadoop/etc/hadoop
[hadoop@test hadoop]$ vim slaves

Add in the slaves file

//Location of nodes in datanode
test2
test3

11. Copy hadoop-2.7.3 to hadoop@test2 and hadoop@test2 machines on test and modify the environment variables in step 9 and perform the following actions:

[hadoop@test hadoop]$ scp -r /opt/hadoop-${version} hadoop@test2:/opt/
[hadoop@test hadoop]$ ln -s /opt/hadoop-${version} /opt/hadoop
[hadoop@test hadoop]$ scp -r /opt/hadoop-${version} hadoop@test3:/opt/
[hadoop@test hadoop]$ ln -s /opt/hadoop-${version} /opt/hadoop

12. Format namenode (only the first startup needs to be formatted!) Start hadoop and start the job history service:

# Format namenode, only the first start needs to be formatted!!
[hadoop@test hadoop]$ hadoop namenode -format
# start-up
[hadoop@test hadoop]$ ${HADOOP_HOME}/sbin/start-all.sh
[hadoop@test hadoop]$ ${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh start historyserver

start-all.sh includes two modules, dfs and yarn. They are start-dfs.sh. , start-yarn.sh So dfs and yarn can be started separately.
Note: If the datanode is not started, see if there is dirty data in tmpdir, delete the directory and delete the other two machines.

13. Check the service of each machine, and input jps on test, test2 and test3 machines respectively:

[hadoop@test ~]$ jps
24429 Jps
22898 ResourceManager
24383 JobHistoryServer
22722 SecondaryNameNode
22488 NameNode
[ahdoop@test2 ~]$ jps
7650 DataNode
7788 NodeManager
8018 Jps
[hadoop@test3 ~]$ jps
28407 Jps
28038 DataNode
28178 NodeManager

If the three machines output the above content normally, the service of hadoop cluster is working normally.

Access the hadoop service page: Enter the following address in the browser: http://172.24.5.173:8088

Run a simple mr program to verify that the cluster is installed successfully

[hadoop@test mapreduce]$ cd /opt/hadoop/share/hadoop/mapreduce
[hadoop@test mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.3.jar pi 2 4
Number of Maps  = 2
Samples per Map = 4
Wrote input for Map #0
Wrote input for Map #1
Starting Job
17/04/06 09:36:47 INFO client.RMProxy: Connecting to ResourceManager at test/172.24.5.173:8032
17/04/06 09:36:47 INFO input.FileInputFormat: Total input paths to process : 2
17/04/06 09:36:48 INFO mapreduce.JobSubmitter: number of splits:2
17/04/06 09:36:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1491470782060_0001
17/04/06 09:36:48 INFO impl.YarnClientImpl: Submitted application application_1491470782060_0001
17/04/06 09:36:48 INFO mapreduce.Job: The url to track the job: http://test:8088/proxy/application_1491470782060_0001/
17/04/06 09:36:48 INFO mapreduce.Job: Running job: job_1491470782060_0001
17/04/06 09:36:56 INFO mapreduce.Job: Job job_1491470782060_0001 running in uber mode : false
17/04/06 09:36:56 INFO mapreduce.Job:  map 0% reduce 0%
17/04/06 09:37:00 INFO mapreduce.Job:  map 50% reduce 0%
17/04/06 09:37:02 INFO mapreduce.Job:  map 100% reduce 0%
17/04/06 09:37:08 INFO mapreduce.Job:  map 100% reduce 100%
17/04/06 09:37:08 INFO mapreduce.Job: Job job_1491470782060_0001 completed successfully
17/04/06 09:37:08 INFO mapreduce.Job: Counters: 49
    File System Counters
        FILE: Number of bytes read=50
        FILE: Number of bytes written=357588
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=554
        HDFS: Number of bytes written=215
        HDFS: Number of read operations=11
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=3
    Job Counters
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=6118
        Total time spent by all reduces in occupied slots (ms)=4004
        Total time spent by all map tasks (ms)=6118
        Total time spent by all reduce tasks (ms)=4004
        Total vcore-milliseconds taken by all map tasks=6118
        Total vcore-milliseconds taken by all reduce tasks=4004
        Total megabyte-milliseconds taken by all map tasks=6264832
        Total megabyte-milliseconds taken by all reduce tasks=4100096
    Map-Reduce Framework
        Map input records=2
        Map output records=4
        Map output bytes=36
        Map output materialized bytes=56
        Input split bytes=318
        Combine input records=0
        Combine output records=0
        Reduce input groups=2
        Reduce shuffle bytes=56
        Reduce input records=4
        Reduce output records=0
        Spilled Records=8
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=213
        CPU time spent (ms)=2340
        Physical memory (bytes) snapshot=713646080
        Virtual memory (bytes) snapshot=6332133376
        Total committed heap usage (bytes)=546308096
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=236
    File Output Format Counters
        Bytes Written=97
Job Finished in 20.744 seconds
Estimated value of Pi is 3.50000000000000000000

Q&A

Q: Can stop-all.sh not stop hadoop clusters?
A: Because the information of hadoop process is stored in tmp, TMP will be emptied regularly

Q: unable to start namenode
A: Name node value in core-site.xml cannot be underlined!!!

Core elements of hadoop

node
- namenode
  Storage meta-information
manage
- nodemanage
  
  1. Managing computing functions in a single node
  2. Maintain communication with Resource Manger and Application Master
  3. Manage the life cycle of containers, monitor the resource usage of each container (memory, CPU, track node health, management log)

Posted by mohamdally on Thu, 31 Jan 2019 23:15:16 -0800

Programmer Group