Big Data Series: Spark's Initial Knowledge of Learning Notes

Keywords: Spark Hadoop Apache shell

1. Introduction to Spark

In 2009, Spark was born at AMPLab Laboratory at the University of Berkeley. Spark is an experimental project with very little code and is a lightweight framework.
In 2010, the University of Berkeley officially opened up the Spark project.
In June 2013, Spark became a project under the Apache Foundation and entered a period of rapid development. Third-party developers contribute a lot of code and are very active.
In February 2014, Spark was called Apache's top project at a rapid pace, while Cloudera, a big data company, announced that it would increase investment in the Spark framework to replace MapReduce.
In April 2014, big data company MapR launched the Spark camp. Apache Mahout abandoned MapReduce and will use Spark as the computing engine.
In May 2014, Spark 1.0.0 was released.
In 2015, Spark became more and more popular in domestic IT industry. More and more companies began to deploy or use Spark to replace MR2, Hive, Storm and other traditional large data parallel computing frameworks.

2. What is Spark?

Apache Spark™ is a unified analytics engine for large-scale data processing.
Unified Analysis Engine for Large Data Sets
Spark is a general memory-based parallel computing framework designed to make data analysis faster
Spark contains a variety of common computing frameworks in the field of large data
- Spark core (offline computing)
- Sparksql (interactive query)
- Spark streaming (real-time computing)
- Spark MLlib (Machine Learning)
- Spark GraphX (Graph Computing)

3. Can Spark replace hadoop?

Not quite right.

Because we can only use spark core instead of mr for offline computing, data storage still depends on hdfs

Spark+Hadoop is the most popular and promising combination in the future big data field.

4. Spark's characteristics

speed
- Memory computing is more than 100 times faster than mr
- Disk computing is more than 10 times faster than mr
Easy to use
- Provides the api interface of Java Scala Python R language
One-stop solution
- Spark core (offline computing)
- Spark SQL (interactive query)
- Spark streaming (real-time computing)
- ...
Can run on any platform
- yarn
- Mesos
- standalone

5. Disadvantages of Spark

JVM memory overhead is too large, and 1G data usually consumes 5G of memory (Project Tungsten is trying to solve this problem)
There is no effective shared memory mechanism between different spark apps (Project Tachyon is trying to introduce distributed memory management so that different spark apps can share cached data)

6. Spark vs MR

Limitations of 6.1 mr

The abstraction level is low, and it needs to be done by hand, so it is difficult to use it.
Provide only two operations, Map and Reduce, lack of expressiveness
A Job only has Map and Reduce phases (Phase). Complex computing requires a lot of Jobs to complete. The dependencies between Jobs are managed by developers themselves.
Intermediate results (reduce d output) are also placed in the HDFS file system
High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing support is insufficient
Poor performance for iterative data processing

6.2 What problems does Spark solve in mr?

The abstraction level is low, and it needs to be done by hand, so it is difficult to use it.
- Abstraction through RDD (Resilient Distributed Data Sets) in spark
Provide only two operations, Map and Reduce, lack of expressiveness
- Various operators are provided in spark
A Job has only Map and Reduce phases
- There can be multiple stages in spark
The intermediate results are also placed in the HDFS file system (slow)
- Intermediate results are stored in memory, which will be written to local disks instead of HDFS.
High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing support is insufficient
- sparksql and sparkstreaming solve the above problems
Poor performance for iterative data processing
- Improving the performance of iterative computing by caching data in memory

Therefore, it is the trend of technology development that Hadoop MapReduce will be replaced by a new generation of large data processing platforms. Among the new generation of large data processing platforms, Spark is currently the most widely recognized and supported.

7. Spark version

Spark 1.6.3: Scala version 2.10.5
Spark 2.2.0: version 2.11.8 of scala (the version of spark 2.x is recommended for new projects)
hadoop2.7.5

8. Installation of Spark stand-alone version

Prepare installation package spark-2.2.0-bin-hadoop 2.7.tgz

tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz  -C /opt/
mv spark-2.2.0-bin-hadoop2.7/ spark

Modify spark-env.sh

export JAVA_HOME=/opt/jdk
export SPARK_MASTER_IP=uplooking01
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=4
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=2g
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Configuring environment variables

#Configuring Spark's environment variables
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Start a stand-alone version of spark
```
start-all-spark.sh
```
View startup
```
http://uplooking01:8080
```

9. Installation of Spark Distributed Cluster

Configure spark-env.sh

[root@uplooking01 /opt/spark/conf]	
        export JAVA_HOME=/opt/jdk
        #Configure master's host
        export SPARK_MASTER_IP=uplooking01
        #Configure the port of master host communication
        export SPARK_MASTER_PORT=7077
        #Configure the number of cpu cores spark uses in each worker
        export SPARK_WORKER_CORES=4
        #Configure that each host has a worker
        export SPARK_WORKER_INSTANCES=1
        #The memory used by worker is 2gb
        export SPARK_WORKER_MEMORY=2g
        #Directories in hadoop configuration files
        export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Configure slaves

[root@uplooking01 /opt/spark/conf]
        uplooking03
        uplooking04
        uplooking05

Distribute spark

[root@uplooking01 /opt/spark/conf]	
        scp -r /opt/spark  uplooking02:/opt/
        scp -r /opt/spark  uplooking03:/opt/
        scp -r /opt/spark  uplooking04:/opt/
        scp -r /opt/spark  uplooking05:/opt/

Distributing environment variables configured on uplooking 01

[root@uplooking01 /]	
        scp -r /etc/profile  uplooking02:/etc/
        scp -r /etc/profile  uplooking03:/etc/
        scp -r /etc/profile  uplooking04:/etc/
        scp -r /etc/profile  uplooking05:/etc/

Start spark

[root@uplooking01 /]	
	start-all-spark.sh

10. Spark High Availability Cluster

Stop running spark cluster first

Modify spark-env.sh

#Note the following two lines
#export SPARK_MASTER_IP=uplooking01
#export SPARK_MASTER_PORT=7077

Add content

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=uplooking03:2181,uplooking04:2181,uplooking05:2181 -Dspark.deploy.zookeeper.dir=/spark"

Distribution of modified [configuration]

scp /opt/spark/conf/spark-env.sh uplooking02:/opt/spark/conf
scp /opt/spark/conf/spark-env.sh uplooking03:/opt/spark/conf
scp /opt/spark/conf/spark-env.sh uplooking04:/opt/spark/conf
scp /opt/spark/conf/spark-env.sh uplooking05:/opt/spark/conf

Start cluster

[root@uplooking01 /]
	start-all-spark.sh

[root@uplooking02 /]
	start-master.sh

11. The first Spark-Shell program

spark-shell  --master spark://uplooking01:7077 
#spark-shell can specify the resource used by the application (total number of kernels, memory used per work) at startup time.
spark-shell  --master  spark://uplooking01:7077   --total-executor-cores 6   --executor-memory 1g

#If you do not specify the default use of all the cores on each worker, and 1 g of memory on each worker

sc.textFile("hdfs://ns1/sparktest/").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect

12. Role in Spark

Master
- Responsible for receiving requests for submitted jobs
- The master is responsible for scheduling resources (starting Coarse Grained Executor Backend in woker)
Worker
- executor in worker is responsible for task execution
Spark-Submitter ===> Driver
- Submit spark application to master

13. General flow of Spark submission jobs

Posted by Pozor on Wed, 11 Sep 2019 19:29:10 -0700

Programmer Group