Summary of kafka learning knowledge points

Keywords: PHP kafka Spark Zookeeper Apache

I. server.properties in kafka configuration file

#broker's global unique number, not duplicated
broker.id=0

#The port used to listen for links, where producer or consumer will establish a connection
port=9092

#Number of threads processing network requests
num.network.threads=3

#Off-the-shelf quantities used to process disk IO
num.io.threads=8

#Buffer size for sending sockets
socket.send.buffer.bytes=102400

#Buffer size for socket acceptance
socket.receive.buffer.bytes=102400

#Buffer size of request socket
socket.request.max.bytes=104857600

#The Storage Path of kafka Run Log
log.dirs=/export/servers/logs/kafka

#Number of topic fragments on current broker
num.partitions=2

#Number of threads used to recover and clean data under data
num.recovery.threads.per.data.dir=1

#The longest time the segment file remains, the timeout will be deleted
log.retention.hours=168

#Maximum time to scroll to generate a new segment file
log.roll.hours=168

#The size of each segment in the log file defaults to 1G
log.segment.bytes=1073741824

#Time to periodically check file size
log.retention.check.interval.ms=300000

#Does Log Cleaning Open
log.cleaner.enable=true

#broker needs to use zookeeper to save meta data
zookeeper.connect=zk01:2181,zk02:2181,zk03:2181

#zookeeper link timeout
zookeeper.connection.timeout.ms=6000

#In partion buffer, the number of messages reaches a threshold, triggering flush to disk
log.flush.interval.messages=10000

#The time of message buffer, reaching the threshold, will trigger flush to disk
log.flush.interval.ms=3000

#delete topic Need server.properties Settings in delete.topic.enable=true Otherwise, it's just markup deletion.
delete.topic.enable=true

#Here's the host.name For this machine IP(important),If not changed,Then the client will throw:Producer connection to localhost:9092 unsuccessful error!
host.name=kafka01

advertised.host.name=192.168.239.128

Create a comprehensive script to start kafka

Configure KAFKA_HOME

#set KAFKA_HOME
export KAFKA_HOME=/export/app/kafka_2.11-1.0.0
export PATH=$PATH:$KAFKA_HOME/bin

Create a one-click startup script file

mkdir -r  /opt/app/onkey/kafka

Create three scripts

vi slave
  node01
  node02
  node03

vi startkafka.sh
cat /export/app/onkey/kafka/slave | while read line
do
{
    echo $line
    ssh $line "source /etc/profile;nohup kafka-server-start.sh /export/servers/kafka/config/server.properties >/dev/null 2>&1 &"
}&
wait
done 

vi stopkafka.sh
cat /export/app/onkey/kafka/slave | while read line
do
{
    echo $line
    ssh $line "source /etc/profile;jps |grep Kafka |cut -c 1-4 |xargs kill -s 9 "
}&
wait
done

Granting authority

chomd 777 startkafka.sh & stopkafka.sh

Verify installation

Our verification steps are two.

The first step is to use the following commands on three machines to see if there are Kafka and zookeeper related service processes.

View the Kafka and zookeeper service processes
ps –ef | grep kafka

The second step is to create a message topic and verify that the message can be produced and consumed properly through console producer and console consumer.

Create message topics
bin/kafka-topics.sh --create \
--replication-factor 3 \
--partition 3 \
--topic user-behavior-topic \
--zookeeper 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181

Run the following command to open console producer.

Start Console Producer
bin/kafka-console-producer.sh --broker-list 192.168.1.1:9092 --topic user-behavior-topic

Open console consumer on another machine.

Start Console Consumer
./kafka-console-consumer.sh --zookeeper 192.168.1.2:2181 --topic user-behavior-topic --from-beginning

Then if you enter a message from producer console, you can see this message from consumer console, which means that the installation is successful.

 

Case Introduction and Programming Implementation

1. Case introduction

In this case, we assume that a forum needs to calculate the web page heat in near real-time according to the clicks, stay time, and whether or not users praise the website, and then dynamically update the hotspot module of the website to display the links of the hottest topics.

2. Case Study

For a user who visits a forum, we need to make an abstraction of his behavior data in order to explain the calculation process of topic heat.

First, we use a vector to define the user's behavior towards a web page, that is, clicking on the web page, staying time, and whether or not to praise it, which can be expressed as follows:

(page001.html, 1, 0.5, 1)

The first item of the vector represents the ID of the web page, the second indicates the number of clicks from entering the web site to leaving the web page, the third indicates the residence time in minutes, the fourth is whether to praise, 1 is praise, 1 is tread, 0 is neutral.

Secondly, we set a weight for calculating the contribution of each behavior to the topic heat. In this paper, we assume that the weight of clicks is 0.8, because the user may browse the topic again because there is no other better topic. The residence time weight is 0.8, because the user may open multiple tab pages at the same time, but what he really cares about is just one topic. Whether to comment or not weighs 1, because it generally means that users are interested in the topic of the page.

Finally, we define the following formula to calculate the contribution of a behavior data to the heat of the web page.

f(x,y,z)=0.8x+0.8y+z

For the above behavioral data (page001.html, 1, 0.5, 1), the formula can be used to obtain:

H(page001)=f(x,y,z)= 0.8x+0.8y+z=0.8*1+0.8*0.5+1*1=2.2

Readers can notice that in this process, we ignore the user itself, that is to say, we don't care who the user is, but only about its contribution to the hot web pages.

3. Production Behavior Data Messages

In this case, we will use a program to simulate user behavior, which randomly pushes 0 to 50 behavior data messages to the user-behavior-topic topic every five seconds. Obviously, this program plays the role of message producer. In practical applications, this function is usually provided by a system. To simplify message processing, we define the message format as follows:

Page ID | Click Number | Residence Time (Minutes) | Do you like it?

And assume that the site has only 100 pages. The following is the Scala implementation source for this class.

Listing 14. UserBehaviorMsgProducer class source code
import scala.util.Random
import java.util.Properties
import kafka.producer.KeyedMessage
import kafka.producer.ProducerConfig
import kafka.producer.Producer
 
class UserBehaviorMsgProducer(brokers: String, topic: String) extends Runnable {
 private val brokerList = brokers
 private val targetTopic = topic
 private val props = new Properties()
 props.put("metadata.broker.list", this.brokerList)
 props.put("serializer.class", "kafka.serializer.StringEncoder")
 props.put("producer.type", "async")
 private val config = new ProducerConfig(this.props)
 private val producer = new Producer[String, String](this.config)
 
 private val PAGE_NUM = 100
 private val MAX_MSG_NUM = 3
 private val MAX_CLICK_TIME = 5
 private val MAX_STAY_TIME = 10
 //Like,1;Dislike -1;No Feeling 0
 private val LIKE_OR_NOT = Array[Int](1, 0, -1)
 
 def run(): Unit = {
 val rand = new Random()
 while (true) {
 //how many user behavior messages will be produced
 val msgNum = rand.nextInt(MAX_MSG_NUM) + 1
 try {
 //generate the message with format like page1|2|7.123|1
 for (i <- 0 to msgNum) {
 var msg = new StringBuilder()
 msg.append("page" + (rand.nextInt(PAGE_NUM) + 1))
 msg.append("|")
 msg.append(rand.nextInt(MAX_CLICK_TIME) + 1)
 msg.append("|")
 msg.append(rand.nextInt(MAX_CLICK_TIME) + rand.nextFloat())
 msg.append("|")
 msg.append(LIKE_OR_NOT(rand.nextInt(3)))
 println(msg.toString())
 //send the generated message to broker
 sendMessage(msg.toString())
 }
 println("%d user behavior messages produced.".format(msgNum+1))
 } catch {
 case e: Exception => println(e)
 }
 try {
 //sleep for 5 seconds after send a micro batch of message
 Thread.sleep(5000)
 } catch {
 case e: Exception => println(e)
 }
 }
 }
 def sendMessage(message: String) = {
 try {
 val data = new KeyedMessage[String, String](this.topic, message);
 producer.send(data);
 } catch {
 case e:Exception => println(e)
 }
 }
}
object UserBehaviorMsgProducerClient {
 def main(args: Array[String]) {
 if (args.length < 2) {
 println("Usage:UserBehaviorMsgProducerClient 192.168.1.1:9092 user-behavior-topic")
 System.exit(1)
 }
 //start the message producer thread
 new Thread(new UserBehaviorMsgProducer(args(0), args(1))).start()
 }
}

4. Writing Spark Streaming Program Consumption Messages

After clarifying the problem to be solved, you can start coding and implementation. For the problems in this case, the basic steps of implementation are as follows:

  • Build the Streaming Context instance of Spark and turn on the checkpoint function. Because we need to use the updateStateByKey primitive to accumulate the calorific value of updating web topics.
  • Using the KafkaUtils.createStream method provided by Spark to consume message topics, this method returns an instance of the ReceiverInputDStream object.
  • For each message, the calorific value of the topic is calculated using the formulas above.
  • Define an anonymous function to add the last calculation result and the new calculation value to get the latest heat value.
  • Call the updateStateByKey primitive and pass in the anonymous function defined above to update the page calorific value.
  • Finally, after getting the latest results, we need to sort the results and print 10 pages with the highest calorific value.

The source code is as follows.

Listing 15. WebPagePopularityValueCalculator class source code
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming.Duration
 
object WebPagePopularityValueCalculator {
 private val checkpointDir = "popularity-data-checkpoint"
 private val msgConsumerGroup = "user-behavior-topic-message-consumer-group"
  
 def main(args: Array[String]) {
 if (args.length < 2) {
 println("Usage:WebPagePopularityValueCalculator zkserver1:2181,
                    zkserver2:2181,zkserver3:2181 consumeMsgDataTimeInterval(secs)")
 System.exit(1)
 }
 val Array(zkServers,processingInterval) = args
 val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator")
 val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt)) 
 //using updateStateByKey asks for enabling checkpoint
 ssc.checkpoint(checkpointDir)
 val kafkaStream = KafkaUtils.createStream(
 //Spark streaming context
 ssc,
 //zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...
 zkServers,
 //kafka message consumer group ID
 msgConsumerGroup,
 //Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
 Map("user-behavior-topic" -> 3))
 val msgDataRDD = kafkaStream.map(_._2)
 //for debug use only
 //println("Coming data in this interval...")
 //msgDataRDD.print()
 // e.g page37|5|1.5119122|-1
 val popularityData = msgDataRDD.map { msgLine =>
 {
 val dataArr: Array[String] = msgLine.split("\\|")
 val pageID = dataArr(0)
 //calculate the popularity value
 val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1
 (pageID, popValue)
 }
 }
 //sum the previous popularity value and current value
 val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {
 iterator.flatMap(t => {
 val newValue:Double = t._2.sum
 val stateValue:Double = t._3.getOrElse(0);
 Some(newValue + stateValue)
 }.map(sumedValue => (t._1, sumedValue)))
 }
 val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00)))
 val stateDstream = popularityData.updateStateByKey[Double](updatePopularityValue,
 new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD)
 //set the checkpoint interval to avoid too frequently data checkpoint which may
 //may significantly reduce operation throughput
 stateDstream.checkpoint(Duration(8*processingInterval.toInt*1000))
 //after calculation, we need to sort the result and only show the top 10 hot pages
 stateDstream.foreachRDD { rdd => {
 val sortedData = rdd.map{ case (k,v) => (v,k) }.sortByKey(false)
 val topKData = sortedData.take(10).map{ case (v,k) => (k,v) }
 topKData.foreach(x => {
 println(x)
 }) 
 }
 }
 ssc.start()
 ssc.awaitTermination()
 }
}

Deployment and testing

Readers can refer to the following steps to deploy and test the sample program provided in this case.

The first step is to start the behavioral message producer program, which can be started directly in Scala IDE. However, start parameters need to be added. The first is the Kafka Broker address, and the second is the name of the target message topic.

Figure 1. UserBehaviorMsgProducer class startup parameters

After startup, you can see that the console has behavior message data generation.

Figure 2. Preview of generated behavior message data

The second step is to start the Spark Streaming program as a consumer of behavioral messages, which needs to be started in the Spark cluster environment. The commands are as follows:

Listing 16. WebPagePopularityValueCalculator class startup command
bin/spark-submit \
--jars $SPARK_HOME/lib/spark-streaming-kafka_2.10-1.3.1.jar, \
$SPARK_HOME/lib/spark-streaming-kafka-assembly_2.10-1.3.1.jar, \
$SPARK_HOME/lib/kafka_2.10-0.8.2.1.jar, \
$SPARK_HOME/lib/kafka-clients-0.8.2.1.jar \ 
--class com.ibm.spark.exercise.streaming.WebPagePopularityValueCalculator 
--master spark://<spark_master_ip>:7077 \
--num-executors 4 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 2 \
/home/fams/sparkexercise.jar \
192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 2

Because we need to use or indirectly call Kafka's API in the program, and we need to call Spark Streaming to integrate Kafka's API(KafkaUtils.createStream), we need to upload jar packages from the startup command to each machine in the Spark cluster in advance (in this case, we upload them to the Lib directory of the Spark installation directory, that is, $SPARK_HOME/lib), and start the command. Refer to them.

After startup, we can see a message printed under the command line console, that is, the 10 pages with the highest calorific value calculated.

Figure 3. Preview of the current ranking of topics on the Web

We can also go to Spark Web Console to see the current status of the Spark program. The default address is http://spark_master_ip:8080.

Matters needing attention

Using Spark Streaming to build an efficient and robust stream data computing system, we also need to pay attention to the following aspects.

  • It is necessary to set the interval of data processing reasonably, that is, to ensure that the processing time of each batch of data must be less than the processing interval, and to ensure that the first batch of data has been processed when the next batch of data is processed. Obviously this depends on the computing power of your Spark cluster and the amount of input data.
  • The ability to read input data needs to be improved as much as possible. When Spark Streaming integrates with external systems such as Kafka and Flume, we can start multiple instances of Receiver Input DStream objects in order to avoid the bottleneck of receiving data.
  • Although in this case, we just print out the results of (near) real-time calculation, in fact, many times these results will be saved to the database, HDFS, or sent back to Kafka for other systems to use these data for further business processing.
  • Because of the high real-time requirement of stream computing, any system pause caused by JVM Full GC is unacceptable. In addition to using memory reasonably in programs and cleaning up unnecessary cached data regularly, CMS(Concurrent Mark and Sweep) GC is also the GC method recommended by Spark. It can effectively maintain the pause caused by GC at a very low level. We can add CMS GC-related parameters by adding the -- driver-java-options option when using the spark-submit command.
  • In the official guidance document on integrating Kafka and SparkStreaming provided by Spark, two ways are mentioned. The first is Receiver Based Approach, which receives message data by realizing the function of Kafka consumer in Receiver. The second is Direct Approach, which does not use Receiver, but periodically actively queries the latest offset value in Kafka message partition. Then define the offset range of the messages to be processed in each batch. This paper adopts the first method, because the second method is still in the experimental stage.
  • If we use Receiver Based Approach to integrate Kafka and Park Streaming, we need to take into account the data loss caused by the downtime of Driver or Worker nodes. Under the default configuration, it is possible to cause data loss unless we start the Write Ahead Log(WAL) function. In this case, the message data received from Kafka is synchronously written to WAL and saved to reliable distributed file systems, such as HDFS. This function can be turned on by setting the spark.streaming.receiver.writeAheadLog.enable configuration item to true in the Spark configuration file (conf/spark-defaults.conf). Of course, when WAL is turned on, the throughput of a single Receiver decreases. At this time, we may need to run multiple Receivers in parallel to improve this situation.
  • Because updateStateByKey operations require checkpoint functionality to be turned on, frequent checkpoints can increase program processing time and reduce throughput. By default, the checkpoint interval takes the larger of the steaming program data processing interval or 10 seconds. The officially recommended interval is 5-10 times that of streaming program data processing. It can be set by dsteam. checkpoint (checkpoint Interval). The parameters need to be wrapped in the sample class Duration in milliseconds.

 

The difference between spark reading kafka data createStream and createDirectStream

 

 1,KafkaUtils.createDstream

The constructor is KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic,partitions])
Recivers are used to receive data, using Kafka's high-level consumer api. For all receivers, the data received will be saved in spark executors, and then the job will be started through Spark Streaming to process the data. By default, WAL logs will be lost, which can be stored on HDFS.
A. Creating a receiver to pull data from kafka at regular intervals. The rdd partition of ssc and the topic partition of kafka are not concepts. Therefore, increasing the number of partitions of a specific subject only increases the number of threads consuming topic in a receiver, but does not increase the number of parallel processing data in spark.
B. Multiple receivers can be used to create different DStream s for different group s and topic s
C. If WAL is enabled, the storage level needs to be set, that is, KafkaUtils.createStream(... StorageLevel.MEMORY_AND_DISK_SER)
2.KafkaUtils.createDirectStream

Distinguishing Receiver receives data, this method periodically queries the latest offset from topic+partition of kafka, and then processes data in each batch according to the range of offset, using a simple consumer api of Kafka
Advantage:
A. Simplified parallelism does not require multiple kafka input streams. This method will create the same number of RDDS as kafka partitions and read them from kafka in parallel.
B. Efficient. This method does not require WAL. WAL mode needs to copy data twice, first by kafka, and second by writing to wal.
C. Exactly-once-semantics. The traditional way to read Kafka data is to write the offset into zookeeper through the high-level API of kafka. The possibility of data loss is that the offset of zookeeper and SSC is different. EOS eliminates the inconsistency between zk and SSC migration by implementing Kafka low-level api, and only the offset is saved in checkpoint by ssc. The disadvantage is the inability to use zookeeper-based Kafka monitoring tools

Posted by fellixombc on Thu, 27 Jun 2019 14:32:08 -0700