I. server.properties in kafka configuration file
#broker's global unique number, not duplicated broker.id=0 #The port used to listen for links, where producer or consumer will establish a connection port=9092 #Number of threads processing network requests num.network.threads=3 #Off-the-shelf quantities used to process disk IO num.io.threads=8 #Buffer size for sending sockets socket.send.buffer.bytes=102400 #Buffer size for socket acceptance socket.receive.buffer.bytes=102400 #Buffer size of request socket socket.request.max.bytes=104857600 #The Storage Path of kafka Run Log log.dirs=/export/servers/logs/kafka #Number of topic fragments on current broker num.partitions=2 #Number of threads used to recover and clean data under data num.recovery.threads.per.data.dir=1 #The longest time the segment file remains, the timeout will be deleted log.retention.hours=168 #Maximum time to scroll to generate a new segment file log.roll.hours=168 #The size of each segment in the log file defaults to 1G log.segment.bytes=1073741824 #Time to periodically check file size log.retention.check.interval.ms=300000 #Does Log Cleaning Open log.cleaner.enable=true #broker needs to use zookeeper to save meta data zookeeper.connect=zk01:2181,zk02:2181,zk03:2181 #zookeeper link timeout zookeeper.connection.timeout.ms=6000 #In partion buffer, the number of messages reaches a threshold, triggering flush to disk log.flush.interval.messages=10000 #The time of message buffer, reaching the threshold, will trigger flush to disk log.flush.interval.ms=3000 #delete topic Need server.properties Settings in delete.topic.enable=true Otherwise, it's just markup deletion. delete.topic.enable=true #Here's the host.name For this machine IP(important),If not changed,Then the client will throw:Producer connection to localhost:9092 unsuccessful error! host.name=kafka01 advertised.host.name=192.168.239.128
Create a comprehensive script to start kafka
Configure KAFKA_HOME
#set KAFKA_HOME export KAFKA_HOME=/export/app/kafka_2.11-1.0.0 export PATH=$PATH:$KAFKA_HOME/bin
Create a one-click startup script file
mkdir -r /opt/app/onkey/kafka
Create three scripts
vi slave node01 node02 node03 vi startkafka.sh cat /export/app/onkey/kafka/slave | while read line do { echo $line ssh $line "source /etc/profile;nohup kafka-server-start.sh /export/servers/kafka/config/server.properties >/dev/null 2>&1 &" }& wait done vi stopkafka.sh cat /export/app/onkey/kafka/slave | while read line do { echo $line ssh $line "source /etc/profile;jps |grep Kafka |cut -c 1-4 |xargs kill -s 9 " }& wait done
Granting authority
chomd 777 startkafka.sh & stopkafka.sh
Verify installation
Our verification steps are two.
The first step is to use the following commands on three machines to see if there are Kafka and zookeeper related service processes.
View the Kafka and zookeeper service processes
ps –ef | grep kafka
The second step is to create a message topic and verify that the message can be produced and consumed properly through console producer and console consumer.
Create message topics
bin/kafka-topics.sh --create \
--replication-factor 3 \
--partition 3 \
--topic user-behavior-topic \
--zookeeper 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181
Run the following command to open console producer.
Start Console Producer
bin/kafka-console-producer.sh --broker-list 192.168.1.1:9092 --topic user-behavior-topic
Open console consumer on another machine.
Start Console Consumer
./kafka-console-consumer.sh --zookeeper 192.168.1.2:2181 --topic user-behavior-topic --from-beginning
Then if you enter a message from producer console, you can see this message from consumer console, which means that the installation is successful.
Case Introduction and Programming Implementation
1. Case introduction
In this case, we assume that a forum needs to calculate the web page heat in near real-time according to the clicks, stay time, and whether or not users praise the website, and then dynamically update the hotspot module of the website to display the links of the hottest topics.
2. Case Study
For a user who visits a forum, we need to make an abstraction of his behavior data in order to explain the calculation process of topic heat.
First, we use a vector to define the user's behavior towards a web page, that is, clicking on the web page, staying time, and whether or not to praise it, which can be expressed as follows:
(page001.html, 1, 0.5, 1)
The first item of the vector represents the ID of the web page, the second indicates the number of clicks from entering the web site to leaving the web page, the third indicates the residence time in minutes, the fourth is whether to praise, 1 is praise, 1 is tread, 0 is neutral.
Secondly, we set a weight for calculating the contribution of each behavior to the topic heat. In this paper, we assume that the weight of clicks is 0.8, because the user may browse the topic again because there is no other better topic. The residence time weight is 0.8, because the user may open multiple tab pages at the same time, but what he really cares about is just one topic. Whether to comment or not weighs 1, because it generally means that users are interested in the topic of the page.
Finally, we define the following formula to calculate the contribution of a behavior data to the heat of the web page.
f(x,y,z)=0.8x+0.8y+z
For the above behavioral data (page001.html, 1, 0.5, 1), the formula can be used to obtain:
H(page001)=f(x,y,z)= 0.8x+0.8y+z=0.8*1+0.8*0.5+1*1=2.2
Readers can notice that in this process, we ignore the user itself, that is to say, we don't care who the user is, but only about its contribution to the hot web pages.
3. Production Behavior Data Messages
In this case, we will use a program to simulate user behavior, which randomly pushes 0 to 50 behavior data messages to the user-behavior-topic topic every five seconds. Obviously, this program plays the role of message producer. In practical applications, this function is usually provided by a system. To simplify message processing, we define the message format as follows:
Page ID | Click Number | Residence Time (Minutes) | Do you like it?
And assume that the site has only 100 pages. The following is the Scala implementation source for this class.
Listing 14. UserBehaviorMsgProducer class source code
import scala.util.Random import java.util.Properties import kafka.producer.KeyedMessage import kafka.producer.ProducerConfig import kafka.producer.Producer class UserBehaviorMsgProducer(brokers: String, topic: String) extends Runnable { private val brokerList = brokers private val targetTopic = topic private val props = new Properties() props.put("metadata.broker.list", this.brokerList) props.put("serializer.class", "kafka.serializer.StringEncoder") props.put("producer.type", "async") private val config = new ProducerConfig(this.props) private val producer = new Producer[String, String](this.config) private val PAGE_NUM = 100 private val MAX_MSG_NUM = 3 private val MAX_CLICK_TIME = 5 private val MAX_STAY_TIME = 10 //Like,1;Dislike -1;No Feeling 0 private val LIKE_OR_NOT = Array[Int](1, 0, -1) def run(): Unit = { val rand = new Random() while (true) { //how many user behavior messages will be produced val msgNum = rand.nextInt(MAX_MSG_NUM) + 1 try { //generate the message with format like page1|2|7.123|1 for (i <- 0 to msgNum) { var msg = new StringBuilder() msg.append("page" + (rand.nextInt(PAGE_NUM) + 1)) msg.append("|") msg.append(rand.nextInt(MAX_CLICK_TIME) + 1) msg.append("|") msg.append(rand.nextInt(MAX_CLICK_TIME) + rand.nextFloat()) msg.append("|") msg.append(LIKE_OR_NOT(rand.nextInt(3))) println(msg.toString()) //send the generated message to broker sendMessage(msg.toString()) } println("%d user behavior messages produced.".format(msgNum+1)) } catch { case e: Exception => println(e) } try { //sleep for 5 seconds after send a micro batch of message Thread.sleep(5000) } catch { case e: Exception => println(e) } } } def sendMessage(message: String) = { try { val data = new KeyedMessage[String, String](this.topic, message); producer.send(data); } catch { case e:Exception => println(e) } } } object UserBehaviorMsgProducerClient { def main(args: Array[String]) { if (args.length < 2) { println("Usage:UserBehaviorMsgProducerClient 192.168.1.1:9092 user-behavior-topic") System.exit(1) } //start the message producer thread new Thread(new UserBehaviorMsgProducer(args(0), args(1))).start() } }
4. Writing Spark Streaming Program Consumption Messages
After clarifying the problem to be solved, you can start coding and implementation. For the problems in this case, the basic steps of implementation are as follows:
- Build the Streaming Context instance of Spark and turn on the checkpoint function. Because we need to use the updateStateByKey primitive to accumulate the calorific value of updating web topics.
- Using the KafkaUtils.createStream method provided by Spark to consume message topics, this method returns an instance of the ReceiverInputDStream object.
- For each message, the calorific value of the topic is calculated using the formulas above.
- Define an anonymous function to add the last calculation result and the new calculation value to get the latest heat value.
- Call the updateStateByKey primitive and pass in the anonymous function defined above to update the page calorific value.
- Finally, after getting the latest results, we need to sort the results and print 10 pages with the highest calorific value.
The source code is as follows.
Listing 15. WebPagePopularityValueCalculator class source code
import org.apache.spark.SparkConf import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.HashPartitioner import org.apache.spark.streaming.Duration object WebPagePopularityValueCalculator { private val checkpointDir = "popularity-data-checkpoint" private val msgConsumerGroup = "user-behavior-topic-message-consumer-group" def main(args: Array[String]) { if (args.length < 2) { println("Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2:2181,zkserver3:2181 consumeMsgDataTimeInterval(secs)") System.exit(1) } val Array(zkServers,processingInterval) = args val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator") val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt)) //using updateStateByKey asks for enabling checkpoint ssc.checkpoint(checkpointDir) val kafkaStream = KafkaUtils.createStream( //Spark streaming context ssc, //zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,... zkServers, //kafka message consumer group ID msgConsumerGroup, //Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread Map("user-behavior-topic" -> 3)) val msgDataRDD = kafkaStream.map(_._2) //for debug use only //println("Coming data in this interval...") //msgDataRDD.print() // e.g page37|5|1.5119122|-1 val popularityData = msgDataRDD.map { msgLine => { val dataArr: Array[String] = msgLine.split("\\|") val pageID = dataArr(0) //calculate the popularity value val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1 (pageID, popValue) } } //sum the previous popularity value and current value val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => { iterator.flatMap(t => { val newValue:Double = t._2.sum val stateValue:Double = t._3.getOrElse(0); Some(newValue + stateValue) }.map(sumedValue => (t._1, sumedValue))) } val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00))) val stateDstream = popularityData.updateStateByKey[Double](updatePopularityValue, new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD) //set the checkpoint interval to avoid too frequently data checkpoint which may //may significantly reduce operation throughput stateDstream.checkpoint(Duration(8*processingInterval.toInt*1000)) //after calculation, we need to sort the result and only show the top 10 hot pages stateDstream.foreachRDD { rdd => { val sortedData = rdd.map{ case (k,v) => (v,k) }.sortByKey(false) val topKData = sortedData.take(10).map{ case (v,k) => (k,v) } topKData.foreach(x => { println(x) }) } } ssc.start() ssc.awaitTermination() } }
Deployment and testing
Readers can refer to the following steps to deploy and test the sample program provided in this case.
The first step is to start the behavioral message producer program, which can be started directly in Scala IDE. However, start parameters need to be added. The first is the Kafka Broker address, and the second is the name of the target message topic.
Figure 1. UserBehaviorMsgProducer class startup parameters
After startup, you can see that the console has behavior message data generation.
Figure 2. Preview of generated behavior message data
The second step is to start the Spark Streaming program as a consumer of behavioral messages, which needs to be started in the Spark cluster environment. The commands are as follows:
Listing 16. WebPagePopularityValueCalculator class startup command
bin/spark-submit \ --jars $SPARK_HOME/lib/spark-streaming-kafka_2.10-1.3.1.jar, \ $SPARK_HOME/lib/spark-streaming-kafka-assembly_2.10-1.3.1.jar, \ $SPARK_HOME/lib/kafka_2.10-0.8.2.1.jar, \ $SPARK_HOME/lib/kafka-clients-0.8.2.1.jar \ --class com.ibm.spark.exercise.streaming.WebPagePopularityValueCalculator --master spark://<spark_master_ip>:7077 \ --num-executors 4 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 2 \ /home/fams/sparkexercise.jar \ 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 2
Because we need to use or indirectly call Kafka's API in the program, and we need to call Spark Streaming to integrate Kafka's API(KafkaUtils.createStream), we need to upload jar packages from the startup command to each machine in the Spark cluster in advance (in this case, we upload them to the Lib directory of the Spark installation directory, that is, $SPARK_HOME/lib), and start the command. Refer to them.
After startup, we can see a message printed under the command line console, that is, the 10 pages with the highest calorific value calculated.
Figure 3. Preview of the current ranking of topics on the Web
We can also go to Spark Web Console to see the current status of the Spark program. The default address is http://spark_master_ip:8080.
Matters needing attention
Using Spark Streaming to build an efficient and robust stream data computing system, we also need to pay attention to the following aspects.
- It is necessary to set the interval of data processing reasonably, that is, to ensure that the processing time of each batch of data must be less than the processing interval, and to ensure that the first batch of data has been processed when the next batch of data is processed. Obviously this depends on the computing power of your Spark cluster and the amount of input data.
- The ability to read input data needs to be improved as much as possible. When Spark Streaming integrates with external systems such as Kafka and Flume, we can start multiple instances of Receiver Input DStream objects in order to avoid the bottleneck of receiving data.
- Although in this case, we just print out the results of (near) real-time calculation, in fact, many times these results will be saved to the database, HDFS, or sent back to Kafka for other systems to use these data for further business processing.
- Because of the high real-time requirement of stream computing, any system pause caused by JVM Full GC is unacceptable. In addition to using memory reasonably in programs and cleaning up unnecessary cached data regularly, CMS(Concurrent Mark and Sweep) GC is also the GC method recommended by Spark. It can effectively maintain the pause caused by GC at a very low level. We can add CMS GC-related parameters by adding the -- driver-java-options option when using the spark-submit command.
- In the official guidance document on integrating Kafka and SparkStreaming provided by Spark, two ways are mentioned. The first is Receiver Based Approach, which receives message data by realizing the function of Kafka consumer in Receiver. The second is Direct Approach, which does not use Receiver, but periodically actively queries the latest offset value in Kafka message partition. Then define the offset range of the messages to be processed in each batch. This paper adopts the first method, because the second method is still in the experimental stage.
- If we use Receiver Based Approach to integrate Kafka and Park Streaming, we need to take into account the data loss caused by the downtime of Driver or Worker nodes. Under the default configuration, it is possible to cause data loss unless we start the Write Ahead Log(WAL) function. In this case, the message data received from Kafka is synchronously written to WAL and saved to reliable distributed file systems, such as HDFS. This function can be turned on by setting the spark.streaming.receiver.writeAheadLog.enable configuration item to true in the Spark configuration file (conf/spark-defaults.conf). Of course, when WAL is turned on, the throughput of a single Receiver decreases. At this time, we may need to run multiple Receivers in parallel to improve this situation.
- Because updateStateByKey operations require checkpoint functionality to be turned on, frequent checkpoints can increase program processing time and reduce throughput. By default, the checkpoint interval takes the larger of the steaming program data processing interval or 10 seconds. The officially recommended interval is 5-10 times that of streaming program data processing. It can be set by dsteam. checkpoint (checkpoint Interval). The parameters need to be wrapped in the sample class Duration in milliseconds.
The difference between spark reading kafka data createStream and createDirectStream
1,KafkaUtils.createDstream
The constructor is KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic,partitions])
Recivers are used to receive data, using Kafka's high-level consumer api. For all receivers, the data received will be saved in spark executors, and then the job will be started through Spark Streaming to process the data. By default, WAL logs will be lost, which can be stored on HDFS.
A. Creating a receiver to pull data from kafka at regular intervals. The rdd partition of ssc and the topic partition of kafka are not concepts. Therefore, increasing the number of partitions of a specific subject only increases the number of threads consuming topic in a receiver, but does not increase the number of parallel processing data in spark.
B. Multiple receivers can be used to create different DStream s for different group s and topic s
C. If WAL is enabled, the storage level needs to be set, that is, KafkaUtils.createStream(... StorageLevel.MEMORY_AND_DISK_SER)
2.KafkaUtils.createDirectStream
Distinguishing Receiver receives data, this method periodically queries the latest offset from topic+partition of kafka, and then processes data in each batch according to the range of offset, using a simple consumer api of Kafka
Advantage:
A. Simplified parallelism does not require multiple kafka input streams. This method will create the same number of RDDS as kafka partitions and read them from kafka in parallel.
B. Efficient. This method does not require WAL. WAL mode needs to copy data twice, first by kafka, and second by writing to wal.
C. Exactly-once-semantics. The traditional way to read Kafka data is to write the offset into zookeeper through the high-level API of kafka. The possibility of data loss is that the offset of zookeeper and SSC is different. EOS eliminates the inconsistency between zk and SSC migration by implementing Kafka low-level api, and only the offset is saved in checkpoint by ssc. The disadvantage is the inability to use zookeeper-based Kafka monitoring tools