Running Principle of Spark Streaming

Keywords: Big Data Spark Apache socket

Sequence diagram

1. NetworkWordCount

package yk.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

object NetworkWordCount {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    val lines = ssc.socketTextStream("localhost",9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2. Initialize StreamingContext

For a Spark Streaming application, the first thing to do is to initialize Streaming Context, where the object of SparkConf and Duration (batch interval) are used as parameters to be passed into the auxiliary constructor of Streaming Context. It should be noted that Checkpoint is set to null by default here, the code is as follows:

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

Then the main constructor is called, and the main constructor executes all the statements in the class definition (that is, initialize all fields, and methods are executed only by explicit calls). In this process, that is, the initialization of Streaming Context, member variables are initialized. Among these member variables are DStreamGraph, JobScheduler And Streaming Tab, etc. DStreamGraph is similar to RDD's directed acyclic graph, including DStream's interdependent directed acyclic graph; Job Scheduler's role is to view DStreamGraph regularly and generate running jobs based on incoming data; Streaming Tab provides monitoring of convective data processing while SparkStreaming's operations are running. The relevant codes are as follows:

  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      _cp.graph.setContext(this)
      _cp.graph.restoreCheckpointData()
      _cp.graph
    } else {
      require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(_batchDur)
      newGraph
    }
  }
  private[streaming] val scheduler = new JobScheduler(this)
  private[streaming] val uiTab: Option[StreamingTab] =
    if (conf.getBoolean("spark.ui.enabled", true)) {
      Some(new StreamingTab(this))
    } else {
      None
    }

3. Create InputDStream

In the example, the socketTextStream method of StreamingContext is then called to generate the specific InputDStream. There are three parameters in the socketTextStream method, in which hostname and port represent the host name and port number of the server to connect to, respectively, while StorageLevel.MEMORY_AND_DISK_SER is the storage level of the data.

  /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

Continue tracking the socketStream method, where you create a SocketInputDStream.

4. Start Job Scheduler

After the InputDStream is created, the Start method of StreamingContext is called to start the Spark Streaming application, and the most important one is to start JobScheduler. During the start-up of Job Scheduler, Receiver Tracker and Job Generator are instantiated and started. The startup code of Job Scheduler is as follows:

Posted by bsarika on Sun, 03 Feb 2019 15:03:16 -0800