1. Introduction to Spark
- In 2009, Spark was born at AMPLab Laboratory at the University of Berkeley. Spark is an experimental project with very little code and is a lightweight framework.
- In 2010, the University of Berkeley officially opened up the Spark project.
- In June 2013, Spark became a project under the Apache Foundation and entered a period of rapid development. Third-party developers contribute a lot of code and are very active.
- In February 2014, Spark was called Apache's top project at a rapid pace, while Cloudera, a big data company, announced that it would increase investment in the Spark framework to replace MapReduce.
- In April 2014, big data company MapR launched the Spark camp. Apache Mahout abandoned MapReduce and will use Spark as the computing engine.
- In May 2014, Spark 1.0.0 was released.
- In 2015, Spark became more and more popular in domestic IT industry. More and more companies began to deploy or use Spark to replace MR2, Hive, Storm and other traditional large data parallel computing frameworks.
2. What is Spark?
-
Apache Spark™ is a unified analytics engine for large-scale data processing.
-
Unified Analysis Engine for Large Data Sets
-
Spark is a general memory-based parallel computing framework designed to make data analysis faster
-
Spark contains a variety of common computing frameworks in the field of large data
- Spark core (offline computing)
- Sparksql (interactive query)
- Spark streaming (real-time computing)
- Spark MLlib (Machine Learning)
- Spark GraphX (Graph Computing)
3. Can Spark replace hadoop?
Not quite right.
Because we can only use spark core instead of mr for offline computing, data storage still depends on hdfs
Spark+Hadoop is the most popular and promising combination in the future big data field.
4. Spark's characteristics
-
speed
- Memory computing is more than 100 times faster than mr
- Disk computing is more than 10 times faster than mr
-
Easy to use
- Provides the api interface of Java Scala Python R language
-
One-stop solution
- Spark core (offline computing)
- Spark SQL (interactive query)
- Spark streaming (real-time computing)
- ...
-
Can run on any platform
-
yarn
-
Mesos
-
standalone
-
5. Disadvantages of Spark
- JVM memory overhead is too large, and 1G data usually consumes 5G of memory (Project Tungsten is trying to solve this problem)
- There is no effective shared memory mechanism between different spark apps (Project Tachyon is trying to introduce distributed memory management so that different spark apps can share cached data)
6. Spark vs MR
Limitations of 6.1 mr
- The abstraction level is low, and it needs to be done by hand, so it is difficult to use it.
- Provide only two operations, Map and Reduce, lack of expressiveness
- A Job only has Map and Reduce phases (Phase). Complex computing requires a lot of Jobs to complete. The dependencies between Jobs are managed by developers themselves.
- Intermediate results (reduce d output) are also placed in the HDFS file system
- High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing support is insufficient
- Poor performance for iterative data processing
6.2 What problems does Spark solve in mr?
- The abstraction level is low, and it needs to be done by hand, so it is difficult to use it.
- Abstraction through RDD (Resilient Distributed Data Sets) in spark
- Provide only two operations, Map and Reduce, lack of expressiveness
- Various operators are provided in spark
- A Job has only Map and Reduce phases
- There can be multiple stages in spark
-
The intermediate results are also placed in the HDFS file system (slow)
- Intermediate results are stored in memory, which will be written to local disks instead of HDFS.
- High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing support is insufficient
- sparksql and sparkstreaming solve the above problems
- Poor performance for iterative data processing
- Improving the performance of iterative computing by caching data in memory
Therefore, it is the trend of technology development that Hadoop MapReduce will be replaced by a new generation of large data processing platforms. Among the new generation of large data processing platforms, Spark is currently the most widely recognized and supported.
7. Spark version
- Spark 1.6.3: Scala version 2.10.5
- Spark 2.2.0: version 2.11.8 of scala (the version of spark 2.x is recommended for new projects)
- hadoop2.7.5
8. Installation of Spark stand-alone version
-
Prepare installation package spark-2.2.0-bin-hadoop 2.7.tgz
tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt/ mv spark-2.2.0-bin-hadoop2.7/ spark
-
Modify spark-env.sh
export JAVA_HOME=/opt/jdk export SPARK_MASTER_IP=uplooking01 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=4 export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_MEMORY=2g export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
-
Configuring environment variables
#Configuring Spark's environment variables export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
-
Start a stand-alone version of spark
start-all-spark.sh
-
View startup
http://uplooking01:8080
9. Installation of Spark Distributed Cluster
-
Configure spark-env.sh
[root@uplooking01 /opt/spark/conf] export JAVA_HOME=/opt/jdk #Configure master's host export SPARK_MASTER_IP=uplooking01 #Configure the port of master host communication export SPARK_MASTER_PORT=7077 #Configure the number of cpu cores spark uses in each worker export SPARK_WORKER_CORES=4 #Configure that each host has a worker export SPARK_WORKER_INSTANCES=1 #The memory used by worker is 2gb export SPARK_WORKER_MEMORY=2g #Directories in hadoop configuration files export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
-
Configure slaves
[root@uplooking01 /opt/spark/conf] uplooking03 uplooking04 uplooking05
-
Distribute spark
[root@uplooking01 /opt/spark/conf] scp -r /opt/spark uplooking02:/opt/ scp -r /opt/spark uplooking03:/opt/ scp -r /opt/spark uplooking04:/opt/ scp -r /opt/spark uplooking05:/opt/
-
Distributing environment variables configured on uplooking 01
[root@uplooking01 /] scp -r /etc/profile uplooking02:/etc/ scp -r /etc/profile uplooking03:/etc/ scp -r /etc/profile uplooking04:/etc/ scp -r /etc/profile uplooking05:/etc/
-
Start spark
[root@uplooking01 /] start-all-spark.sh
10. Spark High Availability Cluster
Stop running spark cluster first
-
Modify spark-env.sh
#Note the following two lines #export SPARK_MASTER_IP=uplooking01 #export SPARK_MASTER_PORT=7077
-
Add content
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=uplooking03:2181,uplooking04:2181,uplooking05:2181 -Dspark.deploy.zookeeper.dir=/spark"
-
Distribution of modified [configuration]
scp /opt/spark/conf/spark-env.sh uplooking02:/opt/spark/conf scp /opt/spark/conf/spark-env.sh uplooking03:/opt/spark/conf scp /opt/spark/conf/spark-env.sh uplooking04:/opt/spark/conf scp /opt/spark/conf/spark-env.sh uplooking05:/opt/spark/conf
-
Start cluster
[root@uplooking01 /] start-all-spark.sh
[root@uplooking02 /] start-master.sh
11. The first Spark-Shell program
spark-shell --master spark://uplooking01:7077 #spark-shell can specify the resource used by the application (total number of kernels, memory used per work) at startup time. spark-shell --master spark://uplooking01:7077 --total-executor-cores 6 --executor-memory 1g #If you do not specify the default use of all the cores on each worker, and 1 g of memory on each worker
sc.textFile("hdfs://ns1/sparktest/").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect
12. Role in Spark
-
Master
-
Responsible for receiving requests for submitted jobs
-
The master is responsible for scheduling resources (starting Coarse Grained Executor Backend in woker)
-
-
Worker
- executor in worker is responsible for task execution
-
Spark-Submitter ===> Driver
- Submit spark application to master