Big Data Series: Spark's Initial Knowledge of Learning Notes

Keywords: Spark Hadoop Apache shell

1. Introduction to Spark

  • In 2009, Spark was born at AMPLab Laboratory at the University of Berkeley. Spark is an experimental project with very little code and is a lightweight framework.
  • In 2010, the University of Berkeley officially opened up the Spark project.
  • In June 2013, Spark became a project under the Apache Foundation and entered a period of rapid development. Third-party developers contribute a lot of code and are very active.
  • In February 2014, Spark was called Apache's top project at a rapid pace, while Cloudera, a big data company, announced that it would increase investment in the Spark framework to replace MapReduce.
  • In April 2014, big data company MapR launched the Spark camp. Apache Mahout abandoned MapReduce and will use Spark as the computing engine.
  • In May 2014, Spark 1.0.0 was released.
  • In 2015, Spark became more and more popular in domestic IT industry. More and more companies began to deploy or use Spark to replace MR2, Hive, Storm and other traditional large data parallel computing frameworks.

2. What is Spark?

  • Apache Spark™ is a unified analytics engine for large-scale data processing.

  • Unified Analysis Engine for Large Data Sets

  • Spark is a general memory-based parallel computing framework designed to make data analysis faster

  • Spark contains a variety of common computing frameworks in the field of large data

    • Spark core (offline computing)
    • Sparksql (interactive query)
    • Spark streaming (real-time computing)
    • Spark MLlib (Machine Learning)
    • Spark GraphX (Graph Computing)

3. Can Spark replace hadoop?

Not quite right.

Because we can only use spark core instead of mr for offline computing, data storage still depends on hdfs

Spark+Hadoop is the most popular and promising combination in the future big data field.

4. Spark's characteristics

  • speed

    • Memory computing is more than 100 times faster than mr
    • Disk computing is more than 10 times faster than mr
  • Easy to use

    • Provides the api interface of Java Scala Python R language
  • One-stop solution

    • Spark core (offline computing)
    • Spark SQL (interactive query)
    • Spark streaming (real-time computing)
    • ...
  • Can run on any platform

    • yarn

    • Mesos

    • standalone

5. Disadvantages of Spark

  • JVM memory overhead is too large, and 1G data usually consumes 5G of memory (Project Tungsten is trying to solve this problem)
  • There is no effective shared memory mechanism between different spark apps (Project Tachyon is trying to introduce distributed memory management so that different spark apps can share cached data)

6. Spark vs MR

Limitations of 6.1 mr

  • The abstraction level is low, and it needs to be done by hand, so it is difficult to use it.
  • Provide only two operations, Map and Reduce, lack of expressiveness
  • A Job only has Map and Reduce phases (Phase). Complex computing requires a lot of Jobs to complete. The dependencies between Jobs are managed by developers themselves.
  • Intermediate results (reduce d output) are also placed in the HDFS file system
  • High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing support is insufficient
  • Poor performance for iterative data processing

6.2 What problems does Spark solve in mr?

  • The abstraction level is low, and it needs to be done by hand, so it is difficult to use it.
    • Abstraction through RDD (Resilient Distributed Data Sets) in spark
  • Provide only two operations, Map and Reduce, lack of expressiveness
    • Various operators are provided in spark
  • A Job has only Map and Reduce phases
    • There can be multiple stages in spark
  • The intermediate results are also placed in the HDFS file system (slow)
    • Intermediate results are stored in memory, which will be written to local disks instead of HDFS.
  • High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing support is insufficient
    • sparksql and sparkstreaming solve the above problems
  • Poor performance for iterative data processing
    • Improving the performance of iterative computing by caching data in memory

Therefore, it is the trend of technology development that Hadoop MapReduce will be replaced by a new generation of large data processing platforms. Among the new generation of large data processing platforms, Spark is currently the most widely recognized and supported.

7. Spark version

  • Spark 1.6.3: Scala version 2.10.5
  • Spark 2.2.0: version 2.11.8 of scala (the version of spark 2.x is recommended for new projects)
  • hadoop2.7.5

8. Installation of Spark stand-alone version

  • Prepare installation package spark-2.2.0-bin-hadoop 2.7.tgz

    tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz  -C /opt/
    mv spark-2.2.0-bin-hadoop2.7/ spark
    
  • Modify spark-env.sh

    export JAVA_HOME=/opt/jdk
    export SPARK_MASTER_IP=uplooking01
    export SPARK_MASTER_PORT=7077
    export SPARK_WORKER_CORES=4
    export SPARK_WORKER_INSTANCES=1
    export SPARK_WORKER_MEMORY=2g
    export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
    
  • Configuring environment variables

    #Configuring Spark's environment variables
    export SPARK_HOME=/opt/spark
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
    
  • Start a stand-alone version of spark

    start-all-spark.sh
    
  • View startup

    http://uplooking01:8080
    

9. Installation of Spark Distributed Cluster

  • Configure spark-env.sh

    [root@uplooking01 /opt/spark/conf]	
            export JAVA_HOME=/opt/jdk
            #Configure master's host
            export SPARK_MASTER_IP=uplooking01
            #Configure the port of master host communication
            export SPARK_MASTER_PORT=7077
            #Configure the number of cpu cores spark uses in each worker
            export SPARK_WORKER_CORES=4
            #Configure that each host has a worker
            export SPARK_WORKER_INSTANCES=1
            #The memory used by worker is 2gb
            export SPARK_WORKER_MEMORY=2g
            #Directories in hadoop configuration files
            export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
    
  • Configure slaves

    [root@uplooking01 /opt/spark/conf]
            uplooking03
            uplooking04
            uplooking05
    
  • Distribute spark

    [root@uplooking01 /opt/spark/conf]	
            scp -r /opt/spark  uplooking02:/opt/
            scp -r /opt/spark  uplooking03:/opt/
            scp -r /opt/spark  uplooking04:/opt/
            scp -r /opt/spark  uplooking05:/opt/
    
  • Distributing environment variables configured on uplooking 01

    [root@uplooking01 /]	
            scp -r /etc/profile  uplooking02:/etc/
            scp -r /etc/profile  uplooking03:/etc/
            scp -r /etc/profile  uplooking04:/etc/
            scp -r /etc/profile  uplooking05:/etc/
    
  • Start spark

    [root@uplooking01 /]	
    	start-all-spark.sh
    

10. Spark High Availability Cluster

Stop running spark cluster first

  • Modify spark-env.sh

    #Note the following two lines
    #export SPARK_MASTER_IP=uplooking01
    #export SPARK_MASTER_PORT=7077
    
  • Add content

    export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=uplooking03:2181,uplooking04:2181,uplooking05:2181 -Dspark.deploy.zookeeper.dir=/spark"
    
  • Distribution of modified [configuration]

    scp /opt/spark/conf/spark-env.sh uplooking02:/opt/spark/conf
    scp /opt/spark/conf/spark-env.sh uplooking03:/opt/spark/conf
    scp /opt/spark/conf/spark-env.sh uplooking04:/opt/spark/conf
    scp /opt/spark/conf/spark-env.sh uplooking05:/opt/spark/conf
    
  • Start cluster

    [root@uplooking01 /]
    	start-all-spark.sh
    
    [root@uplooking02 /]
    	start-master.sh
    

11. The first Spark-Shell program

spark-shell  --master spark://uplooking01:7077 
#spark-shell can specify the resource used by the application (total number of kernels, memory used per work) at startup time.
spark-shell  --master  spark://uplooking01:7077   --total-executor-cores 6   --executor-memory 1g

#If you do not specify the default use of all the cores on each worker, and 1 g of memory on each worker
sc.textFile("hdfs://ns1/sparktest/").flatMap(_.split(",")).map((_,1)).reduceByKey(_+_).collect

12. Role in Spark

  • Master

    • Responsible for receiving requests for submitted jobs

    • The master is responsible for scheduling resources (starting Coarse Grained Executor Backend in woker)

  • Worker

    • executor in worker is responsible for task execution
  • Spark-Submitter ===> Driver

    • Submit spark application to master

13. General flow of Spark submission jobs

Posted by Pozor on Wed, 11 Sep 2019 19:29:10 -0700