Configuration of hadoop pseudo distribution mode and some common commands

Keywords: Hadoop sudo vim Java

The history of big data

3V: volume, velocity, variety (structured and unstructured data), value (low value density)

Technical challenges brought by big data

  • Increasing storage capacity
  • Difficulty in obtaining valuable information: search, advertisement, recommendation
  • Data processing scenarios with large capacity, multiple types and high efficiency make it very difficult to obtain valuable information from data

Overview of hadoop theory

A brief history of hadoop

  • apache nutch project is an open source web search engine
  • Google publishes GFS, the predecessor of HDFS
  • Google released mapreduce distributed programming idea
  • nutch open source implements mapreduce

Introduction to hadoop

  • Is an open source distributed computing platform under the apache Software Foundation
  • java language, cross platform
  • It provides the processing ability of massive data in the distributed environment
  • Almost all vendors provide development tools around hadoop

hadoop core

  • Distributed file system HDFS
  • MapReduce for distributed computing

hadoop features

  • high reliability
  • Efficiency
  • high scalability
  • High fault tolerance
  • Low cost
  • linux
  • Support for multiple programming languages

hadoop ecosystem

  • HDFS: distributed file system
  • mapreduce: a distributed parallel programming model
  • yarn: resource management and scheduler
  • tez is the next generation of hadoop query processing framework running on yarn. He will build a mailbox acyclic graph after analyzing and optimizing many mr tasks to ensure the highest working efficiency
  • hive: data warehouse on hadoop
  • hbase: non relational distributed database
  • pig: a large-scale data analysis platform based on hadoop
  • sqoop: used for data transfer between hadoop and traditional database
  • oozie: workflow management system
  • zookeeper: provides distributed coordination and consistency services
  • storm: flow calculation framework
  • flume: a distributed system for massive log collection, aggregation and transmission
  • ambari: a rapid deployment tool
  • kafka: distributed publish and subscribe message system, which can handle all action flow data in consumer scale websites
  • spark: a general parallel framework similar to hadoop mapreduce

hadoop pseudo distribution mode installation

Main process

  • Create users and user groups
sudo useradd -d /home/zhangyu -m zhangyu  
sudo passwd zhangyu
sudo usermod -G sudo zhangyu
su zhangyu
ssh-keygen -t rsa
cat ~/.ssh/ >>~/.ssh/authorized_keys
ssh localhost
  • Create apps and data directories and modify permissions
sudo mkdir /apps
sudo mkdir /data
sudo chown -R zhangyu:zhangyu /apps
sudo chown -R zhangyu:zhangyu /data
  • Download hadoop and java
mkdir -p /data/hadoop1
cd /data/hadoop1
wget java
wget hadoop
tar -xzvf jdk.tar.gz -C /apps
tar -xzvf hadoop.tar.gz -C /apps
cd /apps
mv jdk java
mv hadoop hadoop
  • Add the above two to environment variables
sudo vim ~/.bashrc
export JAVA_HOME=/apps/java
export HADOOP_HOME=/apps/hadoop
source ~/.bashrc
  • Modify hadoop configuration file
cd /apps/hadoop/etc/hadoop

export JAVA_HOME=/apps/java

vim core-site.xml
    <name>hadoop.tmp.dir</name>  //Temporary file storage location
    <name>fs.defaultFS</name>  //Address of the hdfs file system
mkdir -p /data/tmp/hadoop/tmp  

vim hdfs-site.xml
    <name></name>  //Configure metadata information storage location
     <name></name>  //Specific data storage location
     <name>dfs.replication</name>  //Configure the number of backups for each database according to the number of nodes
     <name>dfs.permissions.enabled</name>  //Configure whether hdfs enables permission authentication
  • Add the hostname of the node of the slave role in the cluster to the slave file
vim slaves  //Add the hostname of the node of the slave role in the cluster to the slave file
//Currently, there is only one node, so the content of slaves file is only localhost
  • Format hdfs file system
hadoop namenode -format
  • Enter jps to see if the hdfs related process starts
cd /apps/hadoop/sbin/
hadoop fs -mkdir /myhadoop1
hadoop fs -ls -R /
  • Configure mapreduce
cd /apps/hadoop/etc/hadoop/
mv mapred-site.xml.template mapred-site.xml
vim mapred-site.xml
    <name></name>  //Configure the framework used by the mapreduce task
  • Configure yarn and test
 vim yarn-site.xml
    <name>yarn.nodemanager.aux-services</name>  //Specify the server used
  • Perform tests
cd /apps/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.4.5.jar pi 3 3 

hadoop development plug-in installation

mkdir -p /data/hadoop3
cd /data/hadoop3  
cp /data/hadoop3/hadoop-eclipse-plugin-2.6.0.jar /apps/eclipse/plugins/  
  • Enter graphic interface
window->open perspective->other
 Select map/reduce
 Click the blue elephant in the upper right corner of condole to add relevant configuration
  • Terminal command line
cd /apps/hadoop/sbin

hadoop common commands

Turn hadoop on and off

cd /apps/hadoop/sbin
cd /apps/hadoop/sbin

Command format

hadoop fs -Command target
hadoop fs -ls /user

View version

hdfs version
hdfs dfsadmin -report  //View system status

Directory operation

hadoop fs -ls -R /  
hadoop fs -mkdir /input
hadoop fs -mkdir -p /test/test1/test2
hadoop fs -rm -rf /input

File operation

hadoop fs -touchz test.txt
hadoop fs -put test.txt /input  //Upload the local file to the input file and add
hadoop fs -get /input/test.txt /data //Download the test file in the hadoop cluster to the data directory
hadoop fs -cat /input/test.txt
hadoop fs -tail data.txt //Same as cat
hadoop fs -du -s /data.txt  //View file size
hadoop fs -text /test1/data.txt  //Export source file to text format
hadoop fs -stat data.txt  //Returns statistics for the specified path
hadoop fs -chown root /data.txt  //Change file owner
hadoop fs -chmod 777 data.txt  //Give file 777 permission
hadoop fs -expunge  //Empty Trash 

Mode switching

hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave

Posted by jmansa on Sat, 20 Jun 2020 00:46:03 -0700