Configuration of hadoop pseudo distribution mode and some common commands

Keywords: Hadoop sudo vim Java

The history of big data

3V: volume, velocity, variety (structured and unstructured data), value (low value density)

Technical challenges brought by big data

Increasing storage capacity
Difficulty in obtaining valuable information: search, advertisement, recommendation
Data processing scenarios with large capacity, multiple types and high efficiency make it very difficult to obtain valuable information from data

Overview of hadoop theory

A brief history of hadoop

apache nutch project is an open source web search engine
Google publishes GFS, the predecessor of HDFS
Google released mapreduce distributed programming idea
nutch open source implements mapreduce

Introduction to hadoop

Is an open source distributed computing platform under the apache Software Foundation
java language, cross platform
It provides the processing ability of massive data in the distributed environment
Almost all vendors provide development tools around hadoop

hadoop core

Distributed file system HDFS
MapReduce for distributed computing

hadoop features

high reliability
Efficiency
high scalability
High fault tolerance
Low cost
linux
Support for multiple programming languages

hadoop ecosystem

HDFS: distributed file system
mapreduce: a distributed parallel programming model
yarn: resource management and scheduler
tez is the next generation of hadoop query processing framework running on yarn. He will build a mailbox acyclic graph after analyzing and optimizing many mr tasks to ensure the highest working efficiency
hive: data warehouse on hadoop
hbase: non relational distributed database
pig: a large-scale data analysis platform based on hadoop
sqoop: used for data transfer between hadoop and traditional database
oozie: workflow management system
zookeeper: provides distributed coordination and consistency services
storm: flow calculation framework
flume: a distributed system for massive log collection, aggregation and transmission
ambari: a rapid deployment tool
kafka: distributed publish and subscribe message system, which can handle all action flow data in consumer scale websites
spark: a general parallel framework similar to hadoop mapreduce

hadoop pseudo distribution mode installation

Main process

Create users and user groups

sudo useradd -d /home/zhangyu -m zhangyu  
sudo passwd zhangyu
sudo usermod -G sudo zhangyu
su zhangyu
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys
ssh localhost

Create apps and data directories and modify permissions

sudo mkdir /apps
sudo mkdir /data
sudo chown -R zhangyu:zhangyu /apps
sudo chown -R zhangyu:zhangyu /data

Download hadoop and java

mkdir -p /data/hadoop1
cd /data/hadoop1
wget java
wget hadoop
tar -xzvf jdk.tar.gz -C /apps
tar -xzvf hadoop.tar.gz -C /apps
cd /apps
mv jdk java
mv hadoop hadoop

Add the above two to environment variables

sudo vim ~/.bashrc
export JAVA_HOME=/apps/java
export PATH=JAVA_HOME/bin:$PATH
export HADOOP_HOME=/apps/hadoop
export PATH=HADOOP_HOME/bin:$PATH
source ~/.bashrc
java
hadoop

Modify hadoop configuration file

cd /apps/hadoop/etc/hadoop

vim hadoop-env.sh
export JAVA_HOME=/apps/java

vim core-site.xml
//Add
<property>  
    <name>hadoop.tmp.dir</name>  //Temporary file storage location
    <value>/data/tmp/hadoop/tmp</value>  
</property>  
<property>  
    <name>fs.defaultFS</name>  //Address of the hdfs file system
    <value>hdfs://localhost:9000</value>  
</property>  
mkdir -p /data/tmp/hadoop/tmp  

vim hdfs-site.xml
<property>  
    <name>dfs.namenode.name.dir</name>  //Configure metadata information storage location
    <value>/data/tmp/hadoop/hdfs/name</value>  
</property>  
 <property>  
     <name>dfs.datanode.data.dir</name>  //Specific data storage location
     <value>/data/tmp/hadoop/hdfs/data</value>  
 </property>  
 <property>  
     <name>dfs.replication</name>  //Configure the number of backups for each database according to the number of nodes
     <value>1</value>  
 </property>  
 <property>  
     <name>dfs.permissions.enabled</name>  //Configure whether hdfs enables permission authentication
     <value>false</value>  
 </property>

Add the hostname of the node of the slave role in the cluster to the slave file

vim slaves  //Add the hostname of the node of the slave role in the cluster to the slave file
//Currently, there is only one node, so the content of slaves file is only localhost

Format hdfs file system

hadoop namenode -format

Enter jps to see if the hdfs related process starts

cd /apps/hadoop/sbin/
./start-dfs.sh
jps
hadoop fs -mkdir /myhadoop1
hadoop fs -ls -R /

Configure mapreduce

cd /apps/hadoop/etc/hadoop/
mv mapred-site.xml.template mapred-site.xml
vim mapred-site.xml
<property>  
    <name>mapreduce.framework.name</name>  //Configure the framework used by the mapreduce task
    <value>yarn</value>  
</property>

Configure yarn and test

 vim yarn-site.xml
<property>  
    <name>yarn.nodemanager.aux-services</name>  //Specify the server used
    <value>mapreduce_shuffle</value>  
</property>  
./start-yarn.sh

Perform tests

cd /apps/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.4.5.jar pi 3 3

hadoop development plug-in installation

mkdir -p /data/hadoop3
cd /data/hadoop3  
wget http://192.168.1.100:60000/allfiles/hadoop3/hadoop-eclipse-plugin-2.6.0.jar  
cp /data/hadoop3/hadoop-eclipse-plugin-2.6.0.jar /apps/eclipse/plugins/

Enter graphic interface

window->open perspective->other
 Select map/reduce
 Click the blue elephant in the upper right corner of condole to add relevant configuration

Terminal command line

cd /apps/hadoop/sbin
./start-all.sh

hadoop common commands

Turn hadoop on and off

cd /apps/hadoop/sbin
./start-all.sh
cd /apps/hadoop/sbin
./stop-all.sh

Command format

hadoop fs -Command target
hadoop fs -ls /user

View version

hdfs version
hdfs dfsadmin -report  //View system status

Directory operation

hadoop fs -ls -R /  
hadoop fs -mkdir /input
hadoop fs -mkdir -p /test/test1/test2
hadoop fs -rm -rf /input

File operation

hadoop fs -touchz test.txt
hadoop fs -put test.txt /input  //Upload the local file to the input file and add
hadoop fs -get /input/test.txt /data //Download the test file in the hadoop cluster to the data directory
hadoop fs -cat /input/test.txt
hadoop fs -tail data.txt //Same as cat
hadoop fs -du -s /data.txt  //View file size
hadoop fs -text /test1/data.txt  //Export source file to text format
hadoop fs -stat data.txt  //Returns statistics for the specified path
hadoop fs -chown root /data.txt  //Change file owner
hadoop fs -chmod 777 data.txt  //Give file 777 permission
hadoop fs -expunge  //Empty Trash

Mode switching

hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave

Posted by jmansa on Sat, 20 Jun 2020 00:46:03 -0700

Programmer Group