Big data foundation

Keywords: Big Data Hadoop

Big data concept

Big Data: refers to a data set that cannot be captured, managed and processed by conventional software tools within a certain time range. It is a massive, high growth rate and diversified information asset that requires a new processing mode to have stronger decision-making power, insight and discovery power and process optimization ability
Big data mainly solves the problems of massive data collection, storage, analysis and calculation

Through the analysis and mining of massive data, we can find the internal laws of data, so as to create value for enterprises or countries

Big data features (4V)

Volume (large)

The capacity of typical personal computer hard disk is TB, while the data volume of some large enterprises is close to EB


Double 11 one second trading volume of more than 10 billion


This type of diversity also allows data to be divided into structured data and unstructured data. Compared with the structured data based on database / text, which is easy to store in the past, there are more and more unstructured data, including network log, audio, video, picture, geographic location information, etc. these multiple types of data put forward higher requirements for data processing ability.

Value (low value density)

The value density is inversely proportional to the total amount of data


What is Hadoop

  • Hadoop is a distributed system infrastructure developed by the Apache foundation.
  • It mainly solves the problems of massive data storage and massive data analysis and calculation.
  • In a broad sense, Hadoop usually refers to a broader concept - Hadoop ecosystem.

Three major releases of Hadoop

Hadoop has three major distributions: Apache, Cloudera and Hortonworks.

The most original (basic) version of Apache is the best for getting started. two thousand and six
Cloudera integrates many big data frameworks internally, corresponding to the product CDH. two thousand and eight
Hortonworks documents are good, corresponding to the product HDP. two thousand and eleven
Hortonworks has now been acquired by Cloudera to launch a new brand CDP.

Hadoop advantages

  • High reliability: Hadoop bottom layer maintains multiple data copies, so even if a Hadoop computing element or storage fails, it will not cause data loss
  • High scalability: allocating task data among clusters can easily expand thousands of nodes.
  • Efficiency: under the idea of MapReduce, Hadoop works in parallel to speed up task processing.
  • High fault tolerance: it can automatically reassign failed tasks

Hadoop composition

In the era of Hadoop 1. X, MapReduce in Hadoop handles business logic operation and resource scheduling at the same time, with great coupling.
In the Hadoop 2. X era, Yan was added. Yarn is only responsible for resource scheduling, and MapReduce is only responsible for computing.
Hadoop 3. X has no change in composition


Hadoop Distributed File System (HDFS for short) is a distributed file system.

HDFS Architecture Overview


Store the metadata of the file, such as file name, file directory structure, file attributes (generation time, number of copies, file permissions), block list of each file, DataNode where the block is located, etc.


Store the file block data and the checksum of the block data in the local file system

Secondary NameNode(2nn)

Backup the NameNode metadata at regular intervals


Data block. In order to save large data sets through multiple nodes, HDFS divides large data set files into data blocks. In the existing Hadoop 2 version, the default block size is 128M

Overview of YARN architecture

YT another resource Negotiator (YARN for short), another resource coordinator, is the resource manager of Hadoop

  • Resource Manager (RM): the leader of the entire cluster resources (memory, CPU, etc.)
  • Application master (AM): the master of a single task run
  • NodeManager (N M): single node server resource manager
  • Container: container, which is equivalent to an independent server and is encapsulated inside

Resources required for task operation, such as memory, CPU, disk, network, etc.

There can be multiple clients
Multiple applicationmasters can run on a cluster
There can be multiple containers on each NodeManager

MapReduce Architecture Overview

  • MapReduce divides the calculation process into two stages: Map and Reduce
  • The Map phase processes the input data in parallel
  • In the Reduce phase, the Map results are summarized

Construction of Hadoop running environment

Create three virtual machines

1. Virtual network card IP settings


2. Modify the static IP of the virtual machine

First bigdata login

su root
vi /etc/sysconfig/network-scripts/ifcfg-ens33	

			BOOTPROTO=static					###modify
			ONBOOT=yes							###modify
			IPADDR=				###add to
			NETMASK=				###add to
			GATEWAY=				###add to
			DNS1=                        ###add to


Restart the network card

service network restart

View IP

ip addr


3. Modify host name

su root
vi /etc/hostname

Configure the Linux clone host name mapping hosts file and open / etc/hosts

vi /etc/hosts hadoop102 hadoop103 hadoop104

reboot restart

The other two hosts are configured as follows

Second set bigdata Sign in
su root
vi /etc/sysconfig/network-scripts/ifcfg-ens33	
			BOOTPROTO=static					###
			ONBOOT=yes							###
			IPADDR=				###
			NETMASK=				###
			GATEWAY=				###
 Restart the network card
service network restart

see IP
ip addr

srt Software 
su root
vi /etc/hostname

vi /etc/hosts hadoop102 hadoop103 hadoop104

Third set bigdata Sign in
su root
vi /etc/sysconfig/network-scripts/ifcfg-ens33	
			BOOTPROTO=static					###
			ONBOOT=yes							###
			IPADDR=				###
			NETMASK=				###
			GATEWAY=				###
 Restart the network card
service network restart

see IP
ip addr

srt Software 
su root
vi /etc/hostname

vi /etc/hosts hadoop102 hadoop103 hadoop104

4. Turn off the firewall

		su root
		#Turn off firewall
		systemctl stop firewalld
		#Turn off firewall and start
		systemctl disable firewalld

		#View firewall status
		systemctl status firewalld

5. Configure SSH login free

		1. bigdata Sign in hadoop102
		  su bigdata
		  cd ~/(#Go to my home directory)
		  cd .ssh(If an error is reported, perform the following operations)
		  ssh hadoop102(own-Purpose generation.ssh (file)
		  cd .ssh
		  ssh-keygen -t rsa((three carriage returns)

       # Execute the following three commands respectively
		  ssh-copy-id hadoop102
		  ssh-copy-id hadoop103
		  ssh-copy-id hadoop104

6. Create directory and modify permissions

su root
mkdir /opt/software
mkdir /opt/module

chmod 777 /opt/software
chmod 777 /opt/module

7. Install JDK

Uninstall existing JDK

To query whether Java software is installed:

rpm -qa | grep java

If the installed version is lower than 1.7, uninstall the JDK

sudo rpm -e software package

View JDK installation path:

which java

Use the XShell transport tool to import the JDK into the software folder under the opt directory

Check whether the software package is successfully imported in the opt directory under the Linux system

ls /opt/software/

Unzip the JDK to the / opt/module directory

tar -zxvf jdk-8u251-linux-x64.tar.gz  -C /opt/module

Configure JDK environment variables

su root

vi /etc/profile

export JAVA_HOME=/opt/module/jdk1.8.0_251
export PATH=$PATH:$JAVA_HOME/bin


source /etc/profile

Test whether the JDK is installed successfully

java -version

Configuration of the other two machines

su bigdata
cd /opt
scp -r /opt/module  hadoop103:/opt
scp -r /opt/module  hadoop104:/opt

Respectively in hadoop103,hadoop104 modify/etc/profile
su root
vi /etc/profile
export JAVA_HOME=/opt/module/jdk1.8.0_251
export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile
su bigdata

Installing hadoop

Upload hadoop

Upload hadoop to / opt/software

Unzip / opt/module

cd /opt/software
tar -zxvf hadoop-2.7.2.tar.gz  -C /opt/module

Configure hadoop


cd /opt/module/hadoop-2.7.2/etc/hadoop

#Around line 27
export JAVA_HOME=/opt/module/jdk1.8.0_251

Second: core-site.xml

vi core-site.xml
  <!--to configure hdfs File system default name-->
    <value>hdfs://Hadoop (hostname): 9000 < / value >

Note: the name is an HDFS URL
Configure it in the configuration tab

<!-- appoint HADOOP File system used schema(URI),HDFS My boss( NameNode)Address of -->
		<!-- appoint hadoop The storage directory where files are generated at run time -->
mkdir -p /opt/module/hadoop-2.7.2/data/tmp

Third: hdfs-site.xml

	vi  hdfs-site.xml
<!-- appoint HDFS Number of copies -->

Fourth: mapred-site.xml

mv mapred-site.xml.template mapred-site.xml
		vi mapred-site.xml
<!-- appoint mr Run in yarn upper -->

Fifth: yarn-site.xml

vi yarn-site.xml
<!-- appoint YARN My boss( ResourceManager)Address of -->
		<!-- reducer How to get data -->

Sixth: slaves

 vi slaves

add to


Copy to two other machines

 cd /opt/module
scp -r /opt/module/hadoop-2.7.2  hadoop103:/opt/module
scp -r /opt/module/hadoop-2.7.2  hadoop104:/opt/module

Add environment variable

Add hadoop to the environment variable (three machines)

su root 
vi /etc/profile
export HADOOP_HOME=/opt/module/hadoop-2.7.2
		source /etc/profile
		su bigdata

Format hdfs

Format namenode (Hadoop 102) (only the first use requires formatting)

hdfs namenode -format

Note: the format here is a file system recognized by hadoop. For example, if we buy a hard disk, we need to format it into a file system recognized by windows or Mac and Linux system to use this file system.

Start hadoop (hadoop 102)

#      Start HDFS first
		Restart YARN

		Turn off safe mode
		hdfs dfsadmin -safemode leave

Verify successful startup

	use jps Command validation
# 12. Check whether hadoop is started successfully
    jps The presence of the following process name indicates successful startup 
    5876 SecondaryNameNode
    5702 DataNode
    5995 Jps
    5612 NameNode

Visual management interface

http://hadoop102:50070 (HDFS management interface)
http://hadoop102:8088 (MR management interface)


Hadoop directory structure

  1. bin directory: stores scripts for operating Hadoop related services (hdfs, yarn, mapred)
  2. etc Directory: Hadoop configuration file directory, which stores Hadoop configuration files
  3. lib Directory: local library for storing Hadoop (function of compressing and decompressing data)
  4. sbin Directory: stores scripts for starting or stopping Hadoop related services
  5. share Directory: stores the dependent jar packages, documents, and official cases of Hadoop

Hadoop running mode

Hadoop operation modes include: local mode, pseudo distributed mode and fully distributed mode.

  • Local mode: stand-alone operation, just to demonstrate the official case. Not used in production environment
  • Pseudo distributed mode: it is also a stand-alone operation, but it has all the functions of Hadoop cluster. One server simulates a distributed environment. Individual companies that are short of money are used for testing, and the production environment is not used
  • Fully distributed mode: multiple servers form a distributed environment. Production environment usage

Word count

  1. Create a wcinput folder under the hadoop-3.1.3 file
mkdir wcinput
  1. Create a word.txt file under the wcinput file
cd wcinput
  1. Edit word.txt file
vim word.txt

Enter the following in the file

hadoop yarn
hadoop mapreduce
dyk dyk

Save exit:: wq

  1. Go back to Hadoop directory / opt/module/hadoop-3.1.3
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount wcinput wcoutput
  1. View results
cat wcoutput/part-r-00000

