Spark integrates Kafka and manually maintains offset

Keywords: Spark kafka MySQL Apache

Spark Integrates Kafka's Two Patterns

In development, we often use SparkStreaming to read and process data in kafka in real time. After version 1.3 of SparkStreaming, KafkaUtils provides two methods to create DStream:

Receiver reception: KafkaUtils.createDstream

There is a Receiver as a resident Task running in Executor waiting for data, but a Receiver is inefficient, need to open multiple, then manually merge data, and then process, very troublesome, and the Receiver machine hangs up, part of the data will be lost, need to open WAL (prewritten log) to ensure data security, then efficiency. It will decrease!
Receiver connects the Kafka queue through zookeeper, calls the higher-order API of Kafka, stores offset in zookeeper, maintains by Receiver, and stores Spark data in executor.
spark also saves an offset in Checkpoint to ensure data is not lost when consuming, which may lead to inconsistencies in data.
So no matter what the point of view, Receiver mode is not suitable for use in development.

2.Direct Connection: KafkaUtils.createDirectStream

Direct method is to connect directly to Kafka partition to obtain data, call Kafka low-order API, offset itself to store and maintain. By default, it is maintained by Spark in checkpoint, eliminating the inconsistency with zk (of course, it can also be maintained manually by itself, the offset exists in mysql, redis), and can read large data directly from each partition. The parallel capability is greatly improved.
Therefore, based on Direct mode can be used in development, and with the help of the characteristics of Direct mode + manual operation can ensure that the data Exactly once accurate.

The following code demonstrates manual maintenance of submission offset to MySQL database (Spark-Kafka-0-10 version integration)

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010.{OffsetRange, _}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.kafka.common.TopicPartition
import java.sql.{DriverManager, ResultSet}
import scala.collection.mutable

  *  Manual maintenance of offset to MySQL database
object SparkKafkaOffset {
  def main(args: Array[String]): Unit = {
    //1. Prepare the environment
    val conf = new SparkConf().setAppName("offset").setMaster("local[*]")
    val sc = new SparkContext(conf)
    //Segmenting data once in five seconds to form an RDD
    val ssc = new StreamingContext(sc,Seconds(5))
    //Setting parameters for connecting Kafka
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "cdh01:9092,cdh02:9092,cdh03:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "" -> "SparkKafkaOffset",
      "auto.offset.reset" -> "latest",
      "" -> (false: java.lang.Boolean)
    val topics = Array("spark_kafka")
    //2. Using KafkaUtil to connect Kafak to get data
    val offsetMap: mutable.Map[TopicPartition, Long] = OffsetUtil.getOffsetMap("SparkKafkaOffset","spark_kafka")
    val recordDStream: InputDStream[ConsumerRecord[String, String]] = if(offsetMap.size > 0){
      //Recorded offset, starting with that offset
      KafkaUtils.createDirectStream[String, String](ssc,
      LocationStrategies.PreferConsistent,//Location strategy: This strategy will make Spark's Executor and Kafka's Broker uniformly correspond
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams,offsetMap))//Consumption Strategy
      //If offset is not recorded in MySQL, it is connected directly and consumed from latest.
      KafkaUtils.createDirectStream[String, String](ssc,
      ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
    //3. Operational data
    //Note: We have to maintain the offset manually by ourselves, which means that if we consume a small amount of data, we should submit an offset once.
    //This small batch of data is represented in the form of RDD in DStream, so we need to operate on RDD in DStream.
    //The API s that operate on RDD in DStream are transform and foreach RDD.
      if(rdd.count() > 0){//Current batches of data available at this time
        rdd.foreach(record => println("Received Kafk The data sent is:" + record))
        //The data received from Kafk are ConsumerRecord (topic = spark_kafka, partition = 1, offset = 6, CreateTime = 1565400670211, checksum = 1551891492, serialized key size = 1, serialized value = 43, key = null, value = Hadoop spark...)
        //Note: By printing the received message, you can see that there are offset s we need to maintain and data we need to process.
        //Next, you can process the data... or use transform to return as before.
        //Maintaining offset: To facilitate our maintenance/management of offset, spark provides a class that helps us encapsulate offset data.
        val offsetRanges: Array[OffsetRange] = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        for (o <- offsetRanges){
        //Manual submission of offset, default submission to Checkpoint

   /* val lineDStream: DStream[String] = to Consumer Record
    val wrodDStream: DStream[String] = lineDStream.flatMap(_.split(" ")) //_It refers to the value sent in, that is, a row of data.
    val wordAndOneDStream: DStream[(String, Int)] =,1))
    val result: DStream[(String, Int)] = wordAndOneDStream.reduceByKey(_+_)
    ssc.awaitTermination()//Waiting for grace to stop

  Manual maintenance of offset tool classes
  First create the following table in MySQL
    CREATE TABLE `t_offset` (
      `topic` varchar(255) NOT NULL,
      `partition` int(11) NOT NULL,
      `groupid` varchar(255) NOT NULL,
      `offset` bigint(20) DEFAULT NULL,
      PRIMARY KEY (`topic`,`partition`,`groupid`)
  object OffsetUtil {

      * Read the offset from the database
    def getOffsetMap(groupid: String, topic: String) = {
      val connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/spark?characterEncoding=UTF-8", "root", "123456")
      val pstmt = connection.prepareStatement("select * from t_offset where groupid=? and topic=?")
      pstmt.setString(1, groupid)
      pstmt.setString(2, topic)
      val rs: ResultSet = pstmt.executeQuery()
      val offsetMap = mutable.Map[TopicPartition, Long]()
      while ( {
        offsetMap += new TopicPartition(rs.getString("topic"), rs.getInt("partition")) -> rs.getLong("offset")

      * Save the offset to the database
    def saveOffsetRanges(groupid: String, offsetRange: Array[OffsetRange]) = {
      val connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/spark?characterEncoding=UTF-8", "root", "root")
      //replace into means replace before, insert if not
      val pstmt = connection.prepareStatement("replace into t_offset (`topic`, `partition`, `groupid`, `offset`) values(?,?,?,?)")
      for (o <- offsetRange) {
        pstmt.setString(1, o.topic)
        pstmt.setInt(2, o.partition)
        pstmt.setString(3, groupid)
        pstmt.setLong(4, o.untilOffset)

Posted by zilem on Wed, 04 Sep 2019 20:35:09 -0700