Spark Streaming of big data technology

Keywords: Spark Apache kafka socket

Spark Streaming of big data technology

1: Overview

1. Definition:

Spark Streaming is used for streaming data processing. Spark Streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ and simple TCP sockets. After data input, you can use Spark's highly abstract primitives such as: map, reduce, join, window, etc. to perform operations. The results can also be saved in many places, such as HDFS, database, etc.

Similar to Spark's RDD based concept, Spark Streaming uses discrete stream as an abstract representation, called DStream. DStream is a sequence of data received over time. Internally, the data received in each time interval exists as RDDS, and dstreams are sequences of these RDDS (hence the name "discretization").

  1. Characteristic

    1) easy to use

    2) fault tolerance

    3) Easy integration into Spark system

2: Getting started with DStream

  1. WordCount case operation

    1. Requirements: use netcat The tool continuously sends messages to 9999 port through SparkStreaming Read port data and count the number of words
    2. Add dependency
    3. Writing code
    package com.ityouxin.streaming
    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    object WordCount {
      def main(args: Array[String]): Unit = {
        //1. Initialize Spark configuration information
        val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamWordCount")
        //2. Initialize SparkStreamingContext
        val ssc = new StreamingContext(sparkConf,Seconds(2))
        //3. Create a DStream through the monitoring port, and the data read in is row by row
        //val lineStreams = ssc.socketTextStream("hadoop102",9999)
        //Create DStream by monitoring file Jie
        val lineStreams = ssc.textFileStream("hdfs://hadoop102:9000/fileStream")
        //Divide the data of each line into words
        val wordStreams = lineStreams.flatMap(_.split("\t"))
        //Map words to metagroups (word, 1)
        val wordAndOneStreams =,1))
        //Count the same number of words
        val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_)
        //Start SparkStreamingContext
    4. Start the program and pass Net'Cat send data
    nc -lk 9999
    5. WordCount analysis
    Discretized Stream yes Spark Streaming The basic abstraction, representing the continuous data flow and various Spark Result data flow after primitive operation. In terms of internal implementation, DStream It's a series of continuous RDD To represent. each RDD Contains data over a period of time.
  2. Dstream creation

    Spark Streaming natively supports several different data sources. Some "core" data sources have been packaged into Maven artifacts of Spark Streaming, while others can be obtained through additional artifacts such as Spark Streaming Kafka. Each receiver runs as a long-running task in the spark executor program, so it occupies the CPU core assigned to the application. In addition, we need to have available CPU cores to process the data. This means that if you want to run multiple receivers, you must have at least the same number of cores as the number of receivers, plus the number of cores needed to complete the calculation. For example, if we want to run 10 receivers in a streaming application, we need to allocate at least 11 CPU cores to the application. So if you are running in local mode, do not use local[1].

    1. File data source

    File data stream: it can read all file system files compatible with HDFS API and read through fileStream method. Spark Streaming will monitor the data directory directory and continuously process the moved files. Remember that nested directory is not supported at present.
    Code: streamingContext.textFileStream(dataDirectory)
    matters needing attention:
    1) The file needs to have the same data format;
    2) The way that files enter the data directory needs to be realized by moving or renaming;
    3) Once the file is moved into the directory, it can no longer be modified, even if it is modified, it will not read the new data;

    Case practice

    1. Create directories and files
    hadoop fs -mkdir /fileStream
    touch a.tsv
    2. Writing code
    package com.ityouxin
    import org.apache.spark.SparkConf
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import org.apache.spark.streaming.dstream.DStream
    object FileStream {
    def main(args: Array[String]): Unit = {
    	//1. Initialize Spark configuration information
    	val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamWordCount")
    	//2. Initialize SparkStreamingContext
    	val ssc = new StreamingContext(sparkConf, Seconds(5))
    	//3. Create DStream by monitoring folder
    	val dirStream = ssc.textFileStream("hdfs://hadoop102:9000/fileStream")
    	//4. Divide each row of data into words
    	val wordStreams = dirStream.flatMap(_.split("\t"))
    	//5. Map words to metagroups (word,1)
    	val wordAndOneStreams =, 1))
    	//6. Count the same number of words
    	val wordAndCountStreams = wordAndOneStreams.reduceByKey(_ + _)
    	//7. printing
    	//8. Start SparkStreamingContext
    4. Start program, to fileStream Directory upload file
    hadoop fs -put ./a.tsv /fileStream
    1. RDD queues (learn)

    During the test, you can create a DStream by using ssc.queueStream(queueOfRDDs). Each RDD pushed to this queue will be treated as a DStream.

    package com.ityouxin
    import org.apache.spark.SparkConf
    import org.apache.spark.rdd.RDD
    import org.apache.spark.streaming.dstream.{DStream, InputDStream}
    import org.apache.spark.streaming.{Seconds, StreamingContext}
    import scala.collection.mutable
    object RDDStream {
    def main(args: Array[String]) {
    	//1. Initialize Spark configuration information
    	val conf = new SparkConf().setMaster("local[*]").setAppName("RDDStream")
    	//2. Initialize SparkStreamingContext
    	val ssc = new StreamingContext(conf, Seconds(4))
    	//3. Create RDD queue
    	val rddQueue = new mutable.Queue[RDD[Int]]()
    	//4. Create QueueInputDStream
    	val inputStream = ssc.queueStream(rddQueue,oneAtATime = false)
    	//5. Processing RDD data in the queue
    	val mappedStream =,1))
    	val reducedStream = mappedStream.reduceByKey(_ + _)
    	//6. Print results
    	//7. Start task
    	//8. Loop creation and put RDD into RDD queue
    	for (i <- 1 to 5) {
    	rddQueue += ssc.sparkContext.makeRDD(1 to 300, 10)
    1. Custom data source

    2. Usage and description

      It needs to inherit the Receiver and implement onStart and onStop methods to define data source collection

    3. Case practice

      Requirements: define data source, monitor a port number, and obtain port number content
      1. Acceptance class
      package com.ityouxin.streaming
      import{BufferedReader, InputStreamReader}
      import org.apache.spark.streaming.receiver.Receiver
      class CustomerReceiver(host:String,port:Int) extends Receiver[String](StorageLevel.MEMORY_ONLY){
        //Initializing resources, starting a new thread to receive data
        override def onStart(): Unit = {
          println("onStart Be called")
          //Newly started thread custom receiver method
          new Thread("Socket Receiver"){
            override def run(): Unit = {
        //Release resources
        override def onStop(): Unit = {
          println("onStop Be called")
        def receive():Unit={
          val bufferedReader:BufferedReader = null
          val socket:Socket = null
            //Create a Socket connection
            val socket = new Socket(host,port)
            println("Create a Socket Connect")
            //Read data
           val bufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream,"UTF-8"))
            //Read the String of a row
            var lineString = bufferedReader.readLine()
            while(lineString!=null && !isStopped()){
              println("Read data:" + lineString)
              lineString = bufferedReader.readLine()
            //Resource release
            if (bufferedReader!=null){
            //Close connection
            if (socket!=null){
          }catch {
            case error: Throwable=>{
          //Restart task
      2. Implementation class
      package com.ityouxin.streaming
      import org.apache.spark.SparkConf
      import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
      import org.apache.spark.streaming.{Seconds, StreamingContext}
      object CustomerStreaming {
        def main(args: Array[String]): Unit = {
          val conf = new SparkConf().setMaster("local[*]").setAppName("CustomerStreaming")
          val ssc = new StreamingContext(conf,Seconds(5))
          val lineStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomerReceiver("hadoop102",9999))
          //Cut each line of data into words
          val wordStream = lineStream.flatMap(_.split(" "))
          //=Words mapped to tuples
          val wordCount: DStream[(String, Int)] =,1)).reduceByKey(_+_)
    4. kafka data source (key)

      1. Instructions for use

        The Maven artifact spark-streaming-kafka-0-10_.11 needs to be introduced into the project to use it. The KafkaUtils object provided in the package can create a DStream with your Kafka message in StreamingContext and JavaStreamingContext. Since KafkaUtils can subscribe to multiple topics, the dstreams it creates are composed of paired topics and messages. To create a stream data, the createStream() method needs to be called using the StreamingContext instance, a comma separated ZooKeeper host list string, the name of the consumer group (unique name), and a mapping table from the topic to the number of receiver threads for that topic.

      2. Case practice

        Requirement: read data from Kafka through SparkStreaming, do WordCount, and print to the console

        1. start-up kafka And create topicA
        2.Import dependency
        3. Writing code
        package com.ityouxin.streaming
        import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
        import org.apache.kafka.common.serialization.StringDeserializer
        import org.apache.spark.SparkConf
        import org.apache.spark.rdd.RDD
        import org.apache.spark.streaming.dstream.InputDStream
        import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
        import org.apache.spark.streaming.{Seconds, StreamingContext}
        object KafkaStream {
          def main(args: Array[String]): Unit = {
            //Initialize configuration
            val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("KafkaStream")
            //Initialize StreamingContext
            val ssc = new StreamingContext(conf,Seconds(5))
            val brokers="hadoop102:9092,hadoop103:9092,hadoop104:9092"
            val consumerGroup="spark"
            val topics = Array("topicA")
            val kafkaParams:Map[String,Object]=Map(
            //Create Kafka's DStream
            val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](
              ConsumerStrategies.Subscribe[String, String](topics, kafkaParams))
            // output
            kafkaDStream.foreachRDD((rdd: RDD[ConsumerRecord[String, String]]) =>{
              //Split messages into words
              val words: RDD[String] = rdd.flatMap( _.value().split(" ") )
              //words mapping to meta group
              val wordOneRDD: RDD[(String, Int)] =,1))
              //Count the number according to the key aggregation
              val wordCountRDD: RDD[(String, Int)] = wordOneRDD.reduceByKey(_+_)
              //Print results
        // Create topic
        //bin/ --zookeeper hadoop102:2181 --create --replication-factor 3 --partitions 1 --topic topicA
        //Production news
        //bin/ --broker-list hadoop102:9092 --topic topicA

4: Conversion of DStream

The primitives on DStream are similar to those on RDD, which are divided into Transformations and Output Operations. In addition, there are some special primitives in the transformation operations, such as updateStateByKey(), transform(), and various primitives
Window related primitives.

  1. Stateless transition

    Stateless transformation operation is to apply simple RDD transformation operation to each batch, that is to say, to transform DStream
    Every RDD. Some stateless conversion operations are listed in the following table. Note that DStream conversion operations for key value pairs (such as
    reduceByKey()) to be used in Scala, you need to add import StreamingContext.

    It is important to remember that although these functions appear to act on the entire flow, in fact, each DStream is internally composed of many RDDS (batches), and stateless transformation operations are applied to each RDD separately. For example, reduceByKey() will reduce the data in each time interval, but not between different intervals.

    For example, in the previous wordcount program, we would only count the number of words of the data received in 5 seconds, but not add them up.

    Stateless transformation can also integrate data among multiple dstreams, but also in various time intervals. For example, the key value has the same connection related conversion operations as RDD for DStream, that is, cogroup(), join(), leftOuterJoin(), etc. We can use these operations on DStream, so that the corresponding RDD operations are performed for each batch.

    We can also use the Union () operation of DStream to merge it with the content of another DStream just like in regular Spark, or we can use StreamingContext.union() to merge multiple streams.

  2. Stateful transition • UpdateStateByKey

    The updatestatebykey primitive is used to record history. Sometimes, we need to maintain state across batches in DStream (for example, add wordcount in stream calculation). In this case, updateStateByKey() provides us with access to a state variable, which is used for DStream in the form of key value pair. Given a DStream composed of (key, event) pairs, and passing a function that specifies how to update the corresponding state of each key according to the new event, it can build a new DStream, whose internal data is (key, state) pairs. The result of updateStateByKey() will be a new DStream. The internal RDD sequence is composed of the corresponding (key, state) pairs in each time interval. The updatestatebykey operation allows us to maintain any state when updating with new information.

    To use this function, you need to do the following two steps:

    \1. Define the state, which can be an arbitrary data type.

    \2. Define a state update function that illustrates how to update the state using the previous state and new values from the input stream.

    Using updateStateByKey requires configuration of the checkpoint directory, and checkpoints are used to save the state.

    ‚Äč ssc.checkpoint("hdfs://hadoop102:9000/streamCheck")

  3. Window Operations

    reduceByKeyAndWindow**(func,** invFunc**,** windowLength**,** slideInterval**, [numTasks])** This function is a more efficient version of the above functions. The reduce value of each window is incrementally calculated by using the reduce value of the previous window. This is done by reducing the old data that enters the sliding window data and "reversely reducing" the old data that leaves the window. An example is the "plus" and "minus" count of keys as the window slides. As you can see from the previous introduction, this function is only applicable to "reversible reduce functions", that is, these reduce functions have corresponding "anti reduce" functions (passed in as parameter invFunc * *) * * * * * *. As mentioned above, the number of reduce tasks is configured with optional parameters. Note: in order to use this operation, checkpoints must be available**
    countByValueAndWindow(windowLength,slideInterval, [numTasks]) Call DStream of (K,V) pair, return new DStream of (K,Long) pair, where the value of each key is its frequency in sliding window. As above, you can configure the number of reduce tasks. reduceByWindow() and reduceByKeyAndWindow() allow us to reduce each window more efficiently. They receive a reduction function and execute it on the whole window, such as +. In addition, they have a special form, which allows Spark to calculate reduction results incrementally by only considering the data in the new window and the data out of the window. This special form needs to provide an inverse function of reduction function, for example, the inverse function corresponding to + is -. For large windows, providing inverse functions can greatly improve the execution efficiency
package com.ityouxin.streaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object WindowStrreamWordCount {
  def main(args: Array[String]): Unit = {
    val batchDuration =3
    val conf = new SparkConf().setMaster("local[*]").setAppName("WindowStrreamWordCount")
    val ssc = new StreamingContext(conf,Seconds(batchDuration))
    val lineStream = ssc.socketTextStream("hadoop102",9999)
    val wordStream = lineStream.flatMap(_.split(" "))
    //Set checkpoint
    //Map words to tuples
    val pairs =,1))
    //For window aggregation, the window duration is 4 batchdurations, and the sliding step is 2 batchdurations
    val wordCount:DStream[(String, Int)]  = pairs.reduceByKeyAndWindow((a:Int, b:Int)=>a+b,Seconds(4*batchDuration),Seconds(2*batchDuration))

4: DStream output

The output operation specifies the operation to be performed on the data obtained from the conversion operation of the convection data (such as pushing the result into an external database or outputting it to the screen). Similar to the lazy evaluation in RDD, if a DStream and its derived dstreams are not output, none of them will be evaluated. If the output operation is not set in the StreamingContext, the entire context will not start.

The output operation is as follows:

(1) Print(): print the first 10 elements of each batch of data in DStream on the driver node of the running stream program. This is for development and debugging. In the Python API, the same operation is called print().

(2) Save as text files (prefix, [suffix]): stores the contents of this DStream as a text file. The storage filename for each batch is based on prefix and suffix in the parameters. " prefix-Time_IN_MS[.suffix]”.

(3) saveAsObjectFiles(prefix, [suffix]): save the data in the Stream as SequenceFiles in the way of Java object serialization. The save file name of each batch is based on the parameter "prefix time_in_ms [. Suffix]". Currently not available in Python.

(4) Save as Hadoop files (prefix, [suffix]): save the data in the Stream as Hadoop files. The save file name of each batch is based on the parameter "prefix time_in_ms [. Suffix]". Currently not available in Python API Python.

(5) foreachRDD(func): This is the most common output operation, that is, func is used for every RDD generated in stream. The function func passed in by the parameter should push the data in each RDD to the external system, such as storing the RDD into a file or writing it to the database through the network. Note: function func is executed in the driver of running stream application, and the general function RDD operation forces its operation on stream RDD.

General output operation foreachRDD(), which is used to run arbitrary calculations on RDD in DStream. This is similar to transform(), which allows us to access any RDD. In foreachRDD(), we can reuse all the actions we implemented in Spark. For example, one of the common use cases is to write data to an external database such as MySQL. Be careful:

(1) Connection cannot be written at driver level;

(2) If it is written in foreach, every RDD is created, which is not worth the loss;

(3) Add foreachPartition to create in partition.

Published 83 original articles, won praise 6, visited 1598
Private letter follow

Posted by croakingtoad on Mon, 10 Feb 2020 07:28:21 -0800