The history of big data
3V: volume, velocity, variety (structured and unstructured data), value (low value density)
Technical challenges brought by big data
- Increasing storage capacity
- Difficulty in obtaining valuable information: search, advertisement, recommendation
- Data processing scenarios with large capacity, multiple types and high efficiency make it very difficult to obtain valuable information from data
Overview of hadoop theory
A brief history of hadoop
- apache nutch project is an open source web search engine
- Google publishes GFS, the predecessor of HDFS
- Google released mapreduce distributed programming idea
- nutch open source implements mapreduce
Introduction to hadoop
- Is an open source distributed computing platform under the apache Software Foundation
- java language, cross platform
- It provides the processing ability of massive data in the distributed environment
- Almost all vendors provide development tools around hadoop
hadoop core
- Distributed file system HDFS
- MapReduce for distributed computing
hadoop features
- high reliability
- Efficiency
- high scalability
- High fault tolerance
- Low cost
- linux
- Support for multiple programming languages
hadoop ecosystem
- HDFS: distributed file system
- mapreduce: a distributed parallel programming model
- yarn: resource management and scheduler
- tez is the next generation of hadoop query processing framework running on yarn. He will build a mailbox acyclic graph after analyzing and optimizing many mr tasks to ensure the highest working efficiency
- hive: data warehouse on hadoop
- hbase: non relational distributed database
- pig: a large-scale data analysis platform based on hadoop
- sqoop: used for data transfer between hadoop and traditional database
- oozie: workflow management system
- zookeeper: provides distributed coordination and consistency services
- storm: flow calculation framework
- flume: a distributed system for massive log collection, aggregation and transmission
- ambari: a rapid deployment tool
- kafka: distributed publish and subscribe message system, which can handle all action flow data in consumer scale websites
- spark: a general parallel framework similar to hadoop mapreduce
hadoop pseudo distribution mode installation
Main process
- Create users and user groups
sudo useradd -d /home/zhangyu -m zhangyu sudo passwd zhangyu sudo usermod -G sudo zhangyu su zhangyu ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub >>~/.ssh/authorized_keys ssh localhost
- Create apps and data directories and modify permissions
sudo mkdir /apps sudo mkdir /data sudo chown -R zhangyu:zhangyu /apps sudo chown -R zhangyu:zhangyu /data
- Download hadoop and java
mkdir -p /data/hadoop1 cd /data/hadoop1 wget java wget hadoop tar -xzvf jdk.tar.gz -C /apps tar -xzvf hadoop.tar.gz -C /apps cd /apps mv jdk java mv hadoop hadoop
- Add the above two to environment variables
sudo vim ~/.bashrc export JAVA_HOME=/apps/java export PATH=JAVA_HOME/bin:$PATH export HADOOP_HOME=/apps/hadoop export PATH=HADOOP_HOME/bin:$PATH source ~/.bashrc java hadoop
- Modify hadoop configuration file
cd /apps/hadoop/etc/hadoop vim hadoop-env.sh export JAVA_HOME=/apps/java vim core-site.xml //Add <property> <name>hadoop.tmp.dir</name> //Temporary file storage location <value>/data/tmp/hadoop/tmp</value> </property> <property> <name>fs.defaultFS</name> //Address of the hdfs file system <value>hdfs://localhost:9000</value> </property> mkdir -p /data/tmp/hadoop/tmp vim hdfs-site.xml <property> <name>dfs.namenode.name.dir</name> //Configure metadata information storage location <value>/data/tmp/hadoop/hdfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> //Specific data storage location <value>/data/tmp/hadoop/hdfs/data</value> </property> <property> <name>dfs.replication</name> //Configure the number of backups for each database according to the number of nodes <value>1</value> </property> <property> <name>dfs.permissions.enabled</name> //Configure whether hdfs enables permission authentication <value>false</value> </property>
- Add the hostname of the node of the slave role in the cluster to the slave file
vim slaves //Add the hostname of the node of the slave role in the cluster to the slave file //Currently, there is only one node, so the content of slaves file is only localhost
- Format hdfs file system
hadoop namenode -format
- Enter jps to see if the hdfs related process starts
cd /apps/hadoop/sbin/ ./start-dfs.sh jps hadoop fs -mkdir /myhadoop1 hadoop fs -ls -R /
- Configure mapreduce
cd /apps/hadoop/etc/hadoop/ mv mapred-site.xml.template mapred-site.xml vim mapred-site.xml <property> <name>mapreduce.framework.name</name> //Configure the framework used by the mapreduce task <value>yarn</value> </property>
- Configure yarn and test
vim yarn-site.xml <property> <name>yarn.nodemanager.aux-services</name> //Specify the server used <value>mapreduce_shuffle</value> </property> ./start-yarn.sh
- Perform tests
cd /apps/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.4.5.jar pi 3 3
hadoop development plug-in installation
mkdir -p /data/hadoop3 cd /data/hadoop3 wget http://192.168.1.100:60000/allfiles/hadoop3/hadoop-eclipse-plugin-2.6.0.jar cp /data/hadoop3/hadoop-eclipse-plugin-2.6.0.jar /apps/eclipse/plugins/
- Enter graphic interface
window->open perspective->other Select map/reduce Click the blue elephant in the upper right corner of condole to add relevant configuration
- Terminal command line
cd /apps/hadoop/sbin ./start-all.sh
hadoop common commands
Turn hadoop on and off
cd /apps/hadoop/sbin ./start-all.sh cd /apps/hadoop/sbin ./stop-all.sh
Command format
hadoop fs -Command target hadoop fs -ls /user
View version
hdfs version hdfs dfsadmin -report //View system status
Directory operation
hadoop fs -ls -R / hadoop fs -mkdir /input hadoop fs -mkdir -p /test/test1/test2 hadoop fs -rm -rf /input
File operation
hadoop fs -touchz test.txt hadoop fs -put test.txt /input //Upload the local file to the input file and add hadoop fs -get /input/test.txt /data //Download the test file in the hadoop cluster to the data directory hadoop fs -cat /input/test.txt hadoop fs -tail data.txt //Same as cat hadoop fs -du -s /data.txt //View file size hadoop fs -text /test1/data.txt //Export source file to text format hadoop fs -stat data.txt //Returns statistics for the specified path hadoop fs -chown root /data.txt //Change file owner hadoop fs -chmod 777 data.txt //Give file 777 permission hadoop fs -expunge //Empty Trash
Mode switching
hdfs dfsadmin -safemode enter hdfs dfsadmin -safemode leave