How to quickly build a Spark distributed architecture for big data

Keywords: Hadoop Spark Zookeeper xml

Build our Spark platform from scratch

1. Preparing the centeros environment

In order to build a real cluster environment and achieve a highly available architecture, we should prepare at least three virtual machines as cluster nodes. So I bought three Alibaba cloud servers as our cluster nodes.

 

Notice that the master node is the master node, and the slave node is the slave node as the name implies, and naturally it is the node that works for the master node. In fact, in our cluster, master and slave are not so clearly distinguished, because in fact, they are all "working hard". Of course, when building a cluster, we still need to clarify this concept.

2. Download jdk

  • 1. Download jdk1.8 tar.gz package
wget https://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-linux-x64.tar.gz
  • 2. Decompression
tar -zxvf jdk-8u201-linux-x64.tar.gz

After decompression, you will get

 

  • 3. Configure environment variables

Modify profile

vi /etc/profile

Add as follows

export JAVA_HOME=/usr/local/java1.8/jdk1.8.0_201export JRE_HOME=${JAVA_HOME}/jreexport CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/libexport PATH=${JAVA_HOME}/bin:$PATH

 

source to make it effective

source /etc/profile

Check whether it is effective

java -version

 

See the content as shown in the figure to indicate success.

The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines.

3. Install zookeeper

  • Download zookeeper package
wget https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz

 

  • decompression
tar -zxvf zookeeper-3.4.13.tar.gz

 

  • Enter the zookeeper configuration directory
cd zookeeper-3.4.13/conf
  • Copy profile template
cp zoo_sample.cfg zoo.cfg

 

  • Modify the content of zoo.cfg after copying
dataDir=/home/hadoop/data/zkdatadataLogDir=/home/hadoop/log/zklogserver.1=master:2888:3888server.2=slave1:2888:3888

 

 

  • Configure environment variables
export ZOOKEEPER_HOME=/usr/local/zookeeper/zookeeper-3.4.13export PATH=$PATH:$ZOOKEEPER_HOME/bin

 

  • Make environment variables effective
source /etc/profile
  • Note the sentence in the previous configuration file and configure the data directory
dataDir=/home/hadoop/data/zkdata
  • We create the directory manually and enter it
cd /home/hadoop/data/zkdata/echo 3 > myid

 

  • You need to pay special attention to this
echo 1 > myid
  • This is for this configuration, so in the master, we echo 1, echo 2 for slave1, and echo 3 for slave2
server.1=master:2888:3888server.2=slave1:2888:3888server.3=slave2:2888:3888

 

 

  • Start test after configuration
zkServer.sh start

 

 

  • Check whether the startup is successful after startup
zkServer.sh status

 

 

The above three virtual machines must be operated! Only echo is different the above three virtual machines must be operated! Only echo is different the above three virtual machines must be operated! Only echo is different
  • View status after starting in master

 

  • View status after starting in salve1

 

The Mode in this is different. This is the election mechanism of zookeeper. As for how this mechanism works, here is the following table. There will be special instructions later. So far, the zookeeper cluster has been built

4. Install hadoop

  • 1. Download hadoop-2.7.7.tar.gz through wget
wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
  • 2. Decompress after downloading

Extract a hadoop-2.7.7 directory

tar -zxvf hadoop-2.7.7

 

 

 

  • 3. Configure hadoop environment variables

Modify profile

vi /etc/profile
  • Add hadoop environment variable
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:

 

  • Make environment variables effective
source /etc/profile
  • After configuration, check whether it is effective
 hadoop version

 

 

  • Enter hadoop-2.7.7/etc/hadoop
  • Edit core-site.xml
vi core-site.xml 
  • Add configuration
<configuration>    <!-- Appoint hdfs Of nameservice by myha01 -->    <property>        <name>fs.defaultFS</name>        <value>hdfs://Myha01 / < value > < / property > <! -- specify the temporary directory of Hadoop -- < property > < name > hadoop.tmp.dir < / name > < value > / home / Hadoop / data / Hadoop data / < value > < property > <! -- specify the address of zookeeper -- < property > < name > ha. Zookeeper. Quorum < / name > < value > Master: 2181, slave1:2181, slave2: 2181 < / value > / property >< ! -- timeout time setting of Hadoop link zookeeper -- > < property > < name > ha. Zookeeper. Session timeout. MS < / name > < value > 1000 < / value > < description > MS < / description > < property > < configuration >

 

  • Copy mapred-site.xml.template
cp mapred-site.xml.template mapred-site.xml

 

 

  • Edit mapred-site.xml
vi mapred-site.xml
  • Add the following
<configuration>    <!-- Appoint mr The framework is yarn mode -->    <property>        <name>mapreduce.framework.name</name>        <value>yarn</value>    </property>    <!-- Appoint mapreduce jobhistory address -->    <property>        <name>mapreduce.jobhistory.address</name>        <value>master:10020</value>    </property>    <!-- Task history server's web address -->    <property>        <name>mapreduce.jobhistory.webapp.address</name>        <value>master:19888</value>    </property></configuration>
  • Edit hdfs-site.xml
vi hdfs-site.xml 
  • Add the following
<configuration>    <!-- Specify number of copies -->    <property>        <name>dfs.replication</name>        <value>1</value>    </property>    <!-- To configure namenode and datanode Working directory for-Data storage directory -->    <property>        <name>dfs.namenode.name.dir</name>        <value>/home/hadoop/data/hadoopdata/dfs/name</value>    </property>    <property>        <name>dfs.datanode.data.dir</name>        <value>/home/hadoop/data/hadoopdata/dfs/data</value>    </property>    <!-- Enable webhdfs -->    <property>        <name>dfs.webhdfs.enabled</name>        <value>true</value>    </property>    <!--Appoint hdfs Of nameservice by myha01,Need and core-site.xml Consistent in                 dfs.ha.namenodes.[nameservice id]For in nameservice Each of NameNode Sets a unique identifier.        Configure a comma separated NameNode ID List. This will be DataNode Identify as all NameNode.         For example, if you use"myha01"Act as nameservice ID,And use"nn1"and"nn2"Act as NameNodes Identifier    -->    <property>        <name>dfs.nameservices</name>        <value>myha01</value>    </property>    <!-- myha01 There are two below NameNode,Namely nn1,nn2 -->    <property>        <name>dfs.ha.namenodes.myha01</name>        <value>nn1,nn2</value>    </property>    <!-- nn1 Of RPC Mailing address -->    <property>        <name>dfs.namenode.rpc-address.myha01.nn1</name>        <value>master:9000</value>    </property>

 

 

  • Edit yarn-site.xml
vi yarn-site.xml 
  • Add the following
<configuration>  <!-- open RM High availability -->    <property>        <name>yarn.resourcemanager.ha.enabled</name>        <value>true</value>    </property>    <!-- Appoint RM Of cluster id -->    <property>        <name>yarn.resourcemanager.cluster-id</name>        <value>yrc</value>    </property>    <!-- Appoint RM Name -->    <property>        <name>yarn.resourcemanager.ha.rm-ids</name>        <value>rm1,rm2</value>    </property>    <!-- Assign separately RM Address -->    <property>        <name>yarn.resourcemanager.hostname.rm1</name>        <value>slave1</value>    </property>    <property>        <name>yarn.resourcemanager.hostname.rm2</name>        <value>slave2</value>    </property>    <!-- Appoint zk Cluster address -->    <property>        <name>yarn.resourcemanager.zk-address</name>        <value>master:2181,slave1:2181,slave2:2181</value>    </property>    <property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value>    </property>    <property>        <name>yarn.log-aggregation-enable</name>        <value>true</value>    </property>    <property>        <name>yarn.log-aggregation.retain-seconds</name>        <value>86400</value>    </property>    <!-- Enable automatic recovery -->    <property>        <name>yarn.resourcemanager.recovery.enabled</name>        <value>true</value>    </property>    <!-- Formulate resourcemanager The status information of is stored in zookeeper Cluster -->    <property>        <name>yarn.resourcemanager.store.class</name>        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>    </property></configuration>

 

 

  • Last edit salves
masterslave1slave2

 

 

The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines.
  • Then you can start hadoop
  • First, start the journal node on three nodes. Remember to operate on all three nodes
hadoop-daemon.sh start journalnode
  • After the operation is completed, use the jps command to view. You can see

 

 

 

  • Where QuorumPeerMain is zookeeper, and JournalNode is the content I started
  • Then format the namenode of the master node
hadoop namenode -format

 

 

  • Pay attention to the red box
  • After formatting, check the contents in / home / Hadoop / data / Hadoop data directory

 

 

  • The contents in the directory are copied to slave1. Slave1 is our standby node. We need it to support the high availability mode. When the master is down, slave1 can continue to work instead of slave1.
cd..scp -r hadoopdata/ root@slave1:hadoopdata/

 

 

  • This ensures that the primary and secondary nodes keep the same formatting

Then you can start hadoop

  • Start HDFS at the master node first
start-dfs.sh 

 

 

 

  • Then start start-yarn.sh. Note that start-yarn.sh needs to be started in slave2
start-yarn.sh 

 

 

  • View three hosts with jps

master

 

 

slave1

 

 

slave2

 

 

  • It is noted that both master and slave1 have namenode. In fact, only one of them is in active state and the other is in standby state. How to confirm? We can enter master:50700 in the browser to access

 

 

  • Enter slave1:50700 in the browser to access

 

  • Another way is to view the two nodes we have configured
hdfs haadmin -getServiceState nn1hdfs haadmin -getServiceState nn2

 

5. spark installation

  • Download spark
wget http://mirrors.shu.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
  • decompression
tar -zxvf spark-2.4.0-bin-hadoop2.7.tgz

 

 

  • Enter the configuration directory of spark
cd spark-2.4.0-bin-hadoop2.7/conf
  • Copy the configuration file spark-env.sh.template
cp spark-env.sh.template spark-env.sh

 

 

Edit spark-env.sh

vi spark-env.sh
  • add to the content
export JAVA_HOME=/usr/local/java1.8/jdk1.8.0_201export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.7export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.7/etc/hadoopexport SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=master:2181,slave1:2181,slave2:2181 -Dspark.deploy.zookeeper.dir=/spark"export SPARK_WORKER_MEMORY=300mexport SPARK_WORKER_CORES=1

 

 

Please copy the java environment variables and hadoop environment variables from the system environment variables. After that, spark'work'memory is the memory that spark runs. Spark'work'cores is the CPU core used by spark

The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines. The above operation is exactly the same as the three virtual machines.
  • Configure system environment variables
 vi /etc/profile
  • add to the content
export SPARK_HOME=/usr/local/spark/spark-2.4.0-bin-hadoop2.7export PATH=$PATH:$SPARK_HOME/bin

 

  • Copy the slave.template file
cp slaves.template slaves
  • Make environment variables effective
source  /etc/profile
  • Edit slaves
vi slaves
  • add to the content
masterslave1slave2

 

 

  • Finally, we start spark. Note that even if the environment variable of spark is configured, because start-all.sh conflicts with start-all.sh of hadoop, we must enter the start directory of spark to start all operations.
  • Enter the startup directory
cd spark-2.4.0-bin-hadoop2.7/sbin
  • Execution start
./start-all.sh 

 

  • After execution, use jps to view the status of three nodes
  • master:

 

  • slave1:

 

 

  • slave2:

 

 

Note that there are spark worker processes in all three nodes, and only master processes in master.

Visit master:8080

 

 

Now we have a formal spark environment.

6. Try using

Since we have configured environment variables, you can type spark shell to start directly.

 spark-shell 

 

Here we go to spark shell

Then code

val lise = List(1,2,3,4,5)val data = sc.parallelize(lise)data.foreach(println)

 

`

 

Or we can go to spark python

pyspark

 

 

View sparkContext

Published 10 original articles, praised 0, visited 44
Private letter follow

Posted by knelson on Tue, 04 Feb 2020 23:47:57 -0800