How to quickly build a Spark distributed architecture for big data

Build our Spark platform from scratch

1. Preparing the centeros environment

In order to build a real cluster environment and achieve a highly available architecture, we should prepare at least three virtual machines as cluster nodes. So I bought three Alibaba cloud servers as our cluster nodes.

Notice that the master node is the master node, and the slave node is the slave node as the name implies, and naturally it is the node that works for the master node. In fact, in our cluster, master and slave are not so clearly distinguished, because in fact, they are all "working hard". Of course, when building a cluster, we still need to clarify this concept.

2. Download jdk

1. Download jdk1.8 tar.gz package

wget https://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-linux-x64.tar.gz

2. Decompression

tar -zxvf jdk-8u201-linux-x64.tar.gz

After decompression, you will get

3. Configure environment variables

Modify profile

vi /etc/profile

Add as follows

export JAVA_HOME=/usr/local/java1.8/jdk1.8.0_201export JRE_HOME=${JAVA_HOME}/jreexport CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/libexport PATH=${JAVA_HOME}/bin:$PATH

source to make it effective

source /etc/profile

Check whether it is effective

java -version

See the content as shown in the figure to indicate success.

The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines.

3. Install zookeeper

Download zookeeper package

wget https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz

decompression

tar -zxvf zookeeper-3.4.13.tar.gz

Enter the zookeeper configuration directory

cd zookeeper-3.4.13/conf

Copy profile template

cp zoo_sample.cfg zoo.cfg

Modify the content of zoo.cfg after copying

dataDir=/home/hadoop/data/zkdatadataLogDir=/home/hadoop/log/zklogserver.1=master:2888:3888server.2=slave1:2888:3888

Configure environment variables

export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.13export PATH=$PATH:$ZOOKEEPER_HOME/bin

Make environment variables effective

source /etc/profile

Note the sentence in the previous configuration file and configure the data directory

dataDir=/home/hadoop/data/zkdata

We create the directory manually and enter it

cd /home/hadoop/data/zkdata/echo 3 > myid

You need to pay special attention to this

echo 1 > myid

This is for this configuration, so in the master, we echo 1, echo 2 for slave1, and echo 3 for slave2

server.1=master:2888:3888server.2=slave1:2888:3888server.3=slave2:2888:3888

Start test after configuration

zkServer.sh start

Check whether the startup is successful after startup

zkServer.sh status

The above three virtual machines must be operated! Only echo is different the above three virtual machines must be operated! Only echo is different the above three virtual machines must be operated! Only echo is different

View status after starting in master

View status after starting in salve1

The Mode in this is different. This is the election mechanism of zookeeper. As for how this mechanism works, here is the following table. There will be special instructions later. So far, the zookeeper cluster has been built

4. Install hadoop

1. Download hadoop-2.7.7.tar.gz through wget

wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

2. Decompress after downloading

Extract a hadoop-2.7.7 directory

tar -zxvf hadoop-2.7.7

3. Configure hadoop environment variables

Modify profile

vi /etc/profile

Add hadoop environment variable

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:

Make environment variables effective

source /etc/profile

After configuration, check whether it is effective

 hadoop version

Enter hadoop-2.7.7/etc/hadoop
Edit core-site.xml

vi core-site.xml

Add configuration

<configuration>    <!-- Appoint hdfs Of nameservice by myha01 -->    <property>        <name>fs.defaultFS</name>        <value>hdfs://Myha01 / < value > < / property > <! -- specify the temporary directory of Hadoop -- < property > < name > hadoop.tmp.dir < / name > < value > / home / Hadoop / data / Hadoop data / < value > < property > <! -- specify the address of zookeeper -- < property > < name > ha. Zookeeper. Quorum < / name > < value > Master: 2181, slave1:2181, slave2: 2181 < / value > / property >< ! -- timeout time setting of Hadoop link zookeeper -- > < property > < name > ha. Zookeeper. Session timeout. MS < / name > < value > 1000 < / value > < description > MS < / description > < property > < configuration >

Copy mapred-site.xml.template

cp mapred-site.xml.template mapred-site.xml

Edit mapred-site.xml

vi mapred-site.xml

Add the following

<configuration>    <!-- Appoint mr The framework is yarn mode -->    <property>        <name>mapreduce.framework.name</name>        <value>yarn</value>    </property>    <!-- Appoint mapreduce jobhistory address -->    <property>        <name>mapreduce.jobhistory.address</name>        <value>master:10020</value>    </property>    <!-- Task history server's web address -->    <property>        <name>mapreduce.jobhistory.webapp.address</name>        <value>master:19888</value>    </property></configuration>

Edit hdfs-site.xml

vi hdfs-site.xml

Add the following

<configuration>    <!-- Specify number of copies -->    <property>        <name>dfs.replication</name>        <value>1</value>    </property>    <!-- To configure namenode and datanode Working directory for-Data storage directory -->    <property>        <name>dfs.namenode.name.dir</name>        <value>/home/hadoop/data/hadoopdata/dfs/name</value>    </property>    <property>        <name>dfs.datanode.data.dir</name>        <value>/home/hadoop/data/hadoopdata/dfs/data</value>    </property>    <!-- Enable webhdfs -->    <property>        <name>dfs.webhdfs.enabled</name>        <value>true</value>    </property>    <!--Appoint hdfs Of nameservice by myha01，Need and core-site.xml Consistent in                 dfs.ha.namenodes.[nameservice id]For in nameservice Each of NameNode Sets a unique identifier.        Configure a comma separated NameNode ID List. This will be DataNode Identify as all NameNode.         For example, if you use"myha01"Act as nameservice ID，And use"nn1"and"nn2"Act as NameNodes Identifier    -->    <property>        <name>dfs.nameservices</name>        <value>myha01</value>    </property>    <!-- myha01 There are two below NameNode，Namely nn1，nn2 -->    <property>        <name>dfs.ha.namenodes.myha01</name>        <value>nn1,nn2</value>    </property>    <!-- nn1 Of RPC Mailing address -->    <property>        <name>dfs.namenode.rpc-address.myha01.nn1</name>        <value>master:9000</value>    </property>

Edit yarn-site.xml

vi yarn-site.xml

Add the following

<configuration>  <!-- open RM High availability -->    <property>        <name>yarn.resourcemanager.ha.enabled</name>        <value>true</value>    </property>    <!-- Appoint RM Of cluster id -->    <property>        <name>yarn.resourcemanager.cluster-id</name>        <value>yrc</value>    </property>    <!-- Appoint RM Name -->    <property>        <name>yarn.resourcemanager.ha.rm-ids</name>        <value>rm1,rm2</value>    </property>    <!-- Assign separately RM Address -->    <property>        <name>yarn.resourcemanager.hostname.rm1</name>        <value>slave1</value>    </property>    <property>        <name>yarn.resourcemanager.hostname.rm2</name>        <value>slave2</value>    </property>    <!-- Appoint zk Cluster address -->    <property>        <name>yarn.resourcemanager.zk-address</name>        <value>master:2181,slave1:2181,slave2:2181</value>    </property>    <property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value>    </property>    <property>        <name>yarn.log-aggregation-enable</name>        <value>true</value>    </property>    <property>        <name>yarn.log-aggregation.retain-seconds</name>        <value>86400</value>    </property>    <!-- Enable automatic recovery -->    <property>        <name>yarn.resourcemanager.recovery.enabled</name>        <value>true</value>    </property>    <!-- Formulate resourcemanager The status information of is stored in zookeeper Cluster -->    <property>        <name>yarn.resourcemanager.store.class</name>        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>    </property></configuration>

Last edit salves

masterslave1slave2

The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines.

Then you can start hadoop
First, start the journal node on three nodes. Remember to operate on all three nodes

hadoop-daemon.sh start journalnode

After the operation is completed, use the jps command to view. You can see

Where QuorumPeerMain is zookeeper, and JournalNode is the content I started
Then format the namenode of the master node

hadoop namenode -format

Pay attention to the red box
After formatting, check the contents in / home / Hadoop / data / Hadoop data directory

The contents in the directory are copied to slave1. Slave1 is our standby node. We need it to support the high availability mode. When the master is down, slave1 can continue to work instead of slave1.

cd..scp -r hadoopdata/ root@slave1:hadoopdata/

This ensures that the primary and secondary nodes keep the same formatting

Then you can start hadoop

Start HDFS at the master node first

start-dfs.sh

Then start start-yarn.sh. Note that start-yarn.sh needs to be started in slave2

start-yarn.sh

View three hosts with jps

master

slave1

slave2

It is noted that both master and slave1 have namenode. In fact, only one of them is in active state and the other is in standby state. How to confirm? We can enter master:50700 in the browser to access

Enter slave1:50700 in the browser to access

Another way is to view the two nodes we have configured

hdfs haadmin -getServiceState nn1hdfs haadmin -getServiceState nn2

5. spark installation

Download spark

wget http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

decompression

tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

Enter the configuration directory of spark

cd spark-2.4.0-bin-hadoop2.7/conf

Copy the configuration file spark-env.sh.template

cp spark-env.sh.template spark-env.sh

Edit spark-env.sh

vi spark-env.sh

add to the content

export JAVA_HOME=/usr/local/java1.8/jdk1.8.0_201export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.7/etc/hadoopexport SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=master:2181,slave1:2181,slave2:2181 -Dspark.deploy.zookeeper.dir=/spark"export SPARK_WORKER_MEMORY=300mexport SPARK_WORKER_CORES=1

Please copy the java environment variables and hadoop environment variables from the system environment variables. After that, spark'work'memory is the memory that spark runs. Spark'work'cores is the CPU core used by spark

The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines.

Configure system environment variables

 vi /etc/profile

add to the content

export SPARK_HOME=/usr/local/spark/spark-2.4.0-bin-hadoop2.7export PATH=$PATH:$SPARK_HOME/bin

Copy the slave.template file

cp slaves.template slaves

Make environment variables effective

source  /etc/profile

Edit slaves

vi slaves

add to the content

masterslave1slave2

Finally, we start spark. Note that even if the environment variable of spark is configured, because start-all.sh conflicts with start-all.sh of hadoop, we must enter the start directory of spark to start all operations.
Enter the startup directory

cd spark-2.4.0-bin-hadoop2.7/sbin

Execution start

./start-all.sh

After execution, use jps to view the status of three nodes
master:

slave1:

slave2:

Note that there are spark worker processes in all three nodes, and only master processes in master.

Visit master:8080

Now we have a formal spark environment.

6. Try using

Since we have configured environment variables, you can type spark shell to start directly.

 spark-shell

Here we go to spark shell

Then code

val lise = List(1,2,3,4,5)val data = sc.parallelize(lise)data.foreach(println)

Or we can go to spark python

pyspark

View sparkContext

yiyidsj

Published 10 original articles, praised 0, visited 44

Private letter follow

Posted by knelson on Tue, 04 Feb 2020 23:47:57 -0800

Programmer Group

How to quickly build a Spark distributed architecture for big data

Build our Spark platform from scratch

Hot Keywords