[Spark] 03 Spark Running Environment

Keywords: Spark

1. Local mode

The so-called Local mode is an environment in which Spark code can be executed locally without any additional node resources
Commonly used for teaching, debugging, presentations, etc. The environment in which code was previously run in IDEA is called a development environment, which is different.

1.1 Installation Configuration in Local Mode

gunzip

Upload the spark-3.0.0-bin-hadoop3.2.tgz file to Linux and uncompress it, place it in the specified location, do not include Chinese or spaces in the path, and do not emphasize if decompression is involved in subsequent courseware

tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module

cd /opt/module 
mv spark-3.0.0-bin-hadoop3.2 spark-local

1.2 Use of local mode

1. Start Local Environment

Enter the uncompressed path and launch the Local environment by executing the following instructions

bin/spark-shell

2.Web UI Monitor Page Access

http://Virtual Machine Address: 4040
This page can only be accessed while the Spark task is running and the task cannot be connected after execution

3 Command Line Tools

In the data directory under the unzipped folder, add the word.txt file. Execute the following code instructions in the command line tool (consistent with the code simplification version in IDEA)

sc.textFile("data/word.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect

4 Exit local mode

Press Ctrl+C or enter Scala command: quit

5 Submit an application

In practice, they are all developed in idea, then packaged into jar packages and submitted to the application
Submit a written program

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \    
--master local[2] \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10


--class Represents the main class of the program to be executed. Replace this with the application we wrote
--master local[2] Deployment mode, default to local mode,
  Number represents allocated virtual CPU Number of cores [simultaneous Task Number of)
  
spark-examples_2.12-3.0.0.jar Where the running application class is located jar Pack, can be set to play by ourselves when in use jar package
 The number 10 represents the program's entry parameters, which set the number of tasks currently applied

2. Standalone mode

Local mode has only one machine to operate on, but in real work you still have to submit the application to the appropriate cluster for execution

What is the standalone pattern?

Spark itself can configure cluster mode, independent of external resource scheduling frameworks such as yarn, our so-called Standalone mode
Spark's Stanalone pattern is also a classic master-slave architecture

2.1 standalone mode configuration

1. Cluster planning:

2. Unzip files
Upload the spark-3.0.0-bin-hadoop3.2.tgz file to Linux and unzip it in the specified location

tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module

cd /opt/module 
mv spark-3.0.0-bin-hadoop3.2 spark-standalone

3. Modify the configuration file
1.slavers configuration

Enter the conf directory of the uncompressed path and modify the slaves.template file name to slaves
Remove suffix to take effect

mv slaves.template slaves

Modify the slaves file to add a work node

hadoop102
hadoop103
hadoop104

2.master Configuration

Modify the spark-env.sh.template file name to spark-env.sh configuration master

mv spark-env.sh.template spark-env.sh

Modify the spark-env.sh file to add JAVA_HOME environment variable and master node corresponding to cluster

export JAVA_HOME=/opt/module/jdk1.8.0_144
SPARK_MASTER_HOST= hadoop102
SPARK_MASTER_PORT= 7077

Note: port 7077, equivalent to port 8020 for hadoop3 internal communications, needs to confirm its Hadoop configuration
3. Distribution Spark Home Directory
Distribute the spark-standalone directory because standalone is a cluster mode

xsync spark-standalone

2.2 Using standalone clusters

Execute script command: sbin/start-all.sh start cluster

Start all, you can see that Master is started, then three Worker s are started

View three server running processes

================linux1================
3330 Jps
3238 Worker
3163 Master
================linux2================
2966 Jps
2908 Worker
================linux3================
2978 Worker
3036 Jps

View the Master Resource Monitoring Web UI interface: http://hadoop102:8080

4 Submit an application

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://hadoop102:7077 \
./examples/jars/spark-examples_2.12-3.0.0.jar \
10

- class represents the main class of the program to be executed
- master spark://linux1:7077 Stand-alone deployment mode, connecting to the Spark cluster
Spark-examples_ The jar package where the 2.12-3.0.0.jar runtime class resides
The number 10 represents the program's entry parameters, which set the number of tasks currently applied

When a task is executed, multiple Java processes are generated that are used to start the Task thread

3. yarn mode

Standalone mode provides computing resources by Spark itself without the need for other frameworks. This approach reduces the coupling with other third-party resource frameworks and is highly independent. However, you should also remember that Spark is primarily a computing framework, not a resource scheduling framework, so the resource scheduling provided by Spark itself is not its strength, so it is more reliable to integrate with other professional resource scheduling frameworks. So let's move on to learn how Spark works in a powerful Yarn environment (because Yarn uses a lot in its domestic work).

3.1 yarn mode configuration

1. Unzip the file
Upload the spark-3.0.0-bin-hadoop3.2.tgz file to linux and uncompress it, and place it in the specified location

tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module

cd /opt/module 
mv spark-3.0.0-bin-hadoop3.2 spark-yarn

2. Modify the configuration file

Modify and distribute Hadoop configuration file/opt/module/hadoop/etc/hadoop/yarn-site.xml

<!--Whether to start a thread to check the amount of physical memory being used by each task and kill it if it exceeds the allocated value by default true -->
<property>
     <name>yarn.nodemanager.pmem-check-enabled</name> 
     <value>false</value>
</property>

<!--Whether to start a thread to check the amount of virtual memory being used by each task and kill it if it exceeds the allocated value by default true -->
<property>
     <name>yarn.nodemanager.vmem-check-enabled</name>  //
     <value>false</value>
</property>

Yarn is a resource scheduling platform, resourcemanager allocates containers to a node that can't exceed one gigabyte of memory. If you want to use more than one gigabyte of memory, you violate the resources of other containers and you have to shut them down

Modify conf/spark-env.sh to add JAVA_HOME and YARN_CONF_DIR Configuration

mv spark-env.sh.template spark-env.sh

vim spark-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_212
YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop
#The directory where Yarn's profile is located so that Yarn can find the yarn-site.xml file

3.2 Using spark on yarn

HDFS and YARN clusters must be started with this mode
bin/start-dfs.sh
bin/start-yarn.sh
Submit an application

bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \			    --The running environment is yarn
--deploy-mode cluster \    --Cluster deployment mode
./examples/jars/spark-examples_2.12-3.0.0.jar \
10

4.Spark Submits Task Parameter Description

In submission applications, there are usually several submission parameters at the same time

bin/spark-submit \
--class <main-class>
--master <master-url> \
... # other options
<application-jar> \
[application-arguments]

--class
Classes containing primary functions in Spark programs
--master
The mode (environment) in which the Spark program runs

Local[*] indicates running in local mode
spark://linux1:7077 Represents running in standalone mode
Yarn indicates that the yarn environment is running

--executor-memory 1G
Specify 1 G of free memory per executor
Conform to cluster memory configuration, specific analysis
--total-executor-cores 2
Specifies that all executor s use two cpu cores
--executor-cores
Specify the number of cpu cores to use per executor
application-jar

Packaged application jar with dependencies.
This URL must be globally visible in the cluster.
For example, hdfs://Shared Storage System
If it is file:// path, then all nodes have the same jar in their path

application-arguments
Parameters passed to main() method

Posted by n8r0x on Wed, 17 Nov 2021 09:11:53 -0800

Programmer Group