Examples of Basic Operating Functions in Spark Streaming

Keywords: Big Data Spark Apache Programming Windows

Guide: in In the Spark Streaming document Documents can be roughly divided into: Transformations,Window Operations,Join Operations,Output Operations operation

Article directory


This article illustrates my code cloud Through train

Please get some basic information:
DStream is the basic abstraction provided by Spark Streaming. It represents a continuous data stream, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, DStream is represented by a series of consecutive RDDs, which is Spark's abstraction of immutable distributed data sets. Each RDD in DStream contains data from a specific time interval, as shown in the following figure

Transformations

Through train

1) map(func), which acts func functions on each element and generates a new element, resulting in a new Stream object containing these new elements.

Code

object Map {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)
    val mapLines = lines.map(word => "map_" + word)

    mapLines.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

2) flatMap(func), which acts func functions on each element and generates 0 or more new elements (for example, split below generates >= 0 new elements), and obtains a new DStream object. Include these new elements.

Code

object FlatMap {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)
    val fmapLines = lines.flatMap(_.split(" "))

    fmapLines.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

ps: Here we put a dependency graph on RDD map and flatMap (red block represents an RDD area and black block represents the set of partitions), which means


3) filter(func), every element of DStream is calculated by func method. If the func function returns true, the element is retained, otherwise the element is discarded and a new DStream is returned.

Code

object Filter {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val filterLines = lines.flatMap(_.split(" "))
      .filter(!StringUtils.equals(_, "hello"))

    filterLines.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

4) repartition (num Partitions), reset the partition, can operate on its own.

5) union (other Stream), which returns a new DStream containing the union of elements in the source DStream and other DStream.

Code

def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)
    val union1 = lines.map(word => "union1_" + word)
    val union2 = lines.map(word => "union2_" + word)
    val union1_2 = union1.union(union2)

    union1.print()
    union2.print()
    union1_2.print()

    ssc.start()
    ssc.awaitTermination()
  }

Result

6) count(), which returns a new DStream for a single element RDD by calculating the number of elements in each RDD of the source DStream.

Code

object Count {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)
    val mapLines = lines.map(_.split(" "))
    val fmapLines = lines.flatMap(_.split(" "))

    mapLines.count().print()
    fmapLines.count().print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

7) reduce(func), which returns a new DStream for a single element RDD by using the function func (which accepts two parameters and returns one), in which two parameters (elements) are calculated in pairs.

Code

object Reduce {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    val result = fmapLines.reduce(_ + "*" + _)
    //fmapLines.reduce((a, b) => a + "*" + b)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

8) countByValue(), when called on a Stream element of type K, the element returning the new DStream is a (K, Long) pair, where the value of each key (Long) is its frequency in each RDD of the source DStream.

Code

object countByValue {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    val countByKey = fmapLines.countByValue()

    countByKey.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

9) ReducByKey (func, [numTasks]), when invoked on a Stream element of type (K, V), returns a new DStream of (K, V) pair, where K is the original K and V is calculated by K passing into func.

Note: By default, this uses Spark's default number of parallel tasks (default 2 in local mode and spark.default.parallelism in cluster mode) to group.

Code

object ReduceByKey {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    val tuple = fmapLines.map(word => (word, 1))
    val reduceByKey = tuple.reduceByKey(_ + _)

    reduceByKey.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

10) join (other Stream, [num Tasks]), when invoked on two DStreams of (K, V) and (K, W) pairs, returns the new DStream content (K, (V, W) pair. numTasks Parallelism, optional

Code

object Join {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)
    val words = lines.flatMap(_.split(" "))
    val join1 = words.map(word => (word, "join1_" + word))
    val join2 = words.map(word => (word, "join2_" + word))
    val join1_2 = join1.join(join2)

    join1.print()
    join2.print()
    join1_2.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

11) cogroup (other Stream, [num Tasks]), when invoked on the DStream of (K, V) and (K, W) pairs, returns the new DStream of (K, Seq [V], Seq [W]) tuples. numTasks Parallelism, optional

Code

object Cogroup {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)
    val words = lines.flatMap(_.split(" "))
    val cogroup1 = words.map(word => (word, "cogroup1_" + word))
    val cogroup2 = words.map(word => (word, "cogroup2_" + word))
    val cogroup1_2 = cogroup1.cogroup(cogroup2)

    cogroup1.print()
    cogroup2.print()
    cogroup1_2.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

12)transform(func) Through train The new DStream is returned by applying the RDD-to-RDD function to each RDD of the source DStream. This can be used to perform any RDD operation on DStream.

Code

object Transform {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val words = lines.transform(rdd=>{
      rdd.flatMap(_.split(" "))
    })

    words.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

13)updateStateByKey(func)Through train Returns a new "state" DStream in which the state of each key is updated by applying a given function to the prior state of the key and the new value of the key. This can be used to maintain arbitrary state data for each key.

Code

object UpdateStateByKey {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    //Note that using updateStateByKey you need to configure the checkpoint directory
    ssc.checkpoint("D:\\spark\\checkpoint")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
    val state = result.updateStateByKey[Int](updateFunction _)

    state.print()

    ssc.start()
    ssc.awaitTermination()
  }

  /**
    * Update data
    * @param newValues
    * @param runningCount
    * @return
    */
  def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {

    val current = newValues.sum
    val pre = runningCount.getOrElse(0)
    Some(current + pre)

  }
}

Result

Window Operations

Through train

window: Timely operation of data over a period of time
Window length: window length
sliding interval: window spacing
These two parameters and batch size are multiples, and if they are not, errors will be reported.

1) Window (Windows Length, slideInterval), which takes out the elements of the current length window at the current moment to form a new DStream.

Code

object Window {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    //Calculate the results of the first 10 seconds every 5 seconds
    val window = fmapLines.window(Seconds(10), Seconds(5))

    window.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

2) countByWindow (Windows Length, slideInterval) is similar to count, but Dstream is intercepted by us.

Code

object CountByWindow {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.checkpoint("D:\\spark\\checkpoint")
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)


    val fmapLines = lines.flatMap(_.split(" "))
    //The number of elements in the DStream of the current 10-second time window is counted every 5 seconds:
    val countByWindow = fmapLines.countByWindow(Seconds(10), Seconds(5))

    countByWindow.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

3) ReducByWindow (func, windows Length, slideInterval) is similar to reduce, but Dstream is intercepted by us.

Code

object ReduceByWindow {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.checkpoint("D:\\spark\\checkpoint")
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)


    val fmapLines = lines.flatMap(_.split(" "))
    val reduceByWindow = fmapLines.reduceByWindow(_ + "*" + _, Seconds(10), Seconds(5))

    reduceByWindow.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

4)reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) Through train Similar to reduceByKey, but Dstream is intercepted by us.

Code

object ReduceByKeyAndWindow {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.checkpoint("D:\\spark\\checkpoint")
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    val tuple = fmapLines.map(word => (word, 1))
    val reduceByKeyAndWindow = tuple.reduceByKeyAndWindow((a:Int,b:Int) => (a + b),Seconds(10),Seconds(5))

    reduceByKeyAndWindow.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

5) ReducByKey AndWindow (func, invFunc, windowLength, slideInterval, [numTasks]), compared with the above, an additional function invFunc is passed in. Like the station, there are people who go in, people who go out, people who go in + 1, people who come out - 1.

Code

object ReduceByKeyAndWindow2 {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.checkpoint("D:\\spark\\checkpoint")
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    val tuple = fmapLines.map(word => (word, 1))
    val reduceByKeyAndWindow = tuple.reduceByKeyAndWindow((a: Int, b: Int) => (a + b), (a: Int, b: Int) => (a - b), Seconds(10), Seconds(5))

    reduceByKeyAndWindow.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

6) countByValue AndWindow (Windows Length, slideInterval, [numTasks]) is similar to countByValue, except that Dstream is intercepted by us.

Code

object CountByValueAndWindow {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    ssc.checkpoint("D:\\spark\\checkpoint")
    ssc.sparkContext.setLogLevel("ERROR")

    val lines = ssc.socketTextStream("192.168.31.30", 9999)

    val fmapLines = lines.flatMap(_.split(" "))
    val countByValueAndWindow = fmapLines.countByValueAndWindow(Seconds(10), Seconds(5))

    countByValueAndWindow.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Result

Join Operations

Through train

1)Stream-stream joins Through train

Call join, leftOuterJoin, rightOuterJoin, fullOuterJoin.

2)Stream-dataset joins Through train

val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }

After we call transform, we can connect to dataset.

Output Operations

Through train

Output operation Meaning
print() Print the first ten elements of each batch of data in DStream on the driver node of the running stream application. This is very useful for development and debugging.
saveAsTextFiles(prefix, [suffix]) Save the contents of this DStream as a text file. File names for each batch interval are generated based on prefix and suffix: "prefix-TIME_IN_MS [.suffix]".
saveAsObjectFiles(prefix, [suffix]) Save the contents of this DStream as SequenceFiles serialized Java objects. File names for each batch interval are generated based on prefix and suffix: "prefix-TIME_IN_MS [.suffix]".
saveAsHadoopFiles(prefix, [suffix]) Save the contents of this DStream as SequenceFiles serialized Java objects. File names for each batch interval are generated based on prefix and suffix: "prefix-TIME_IN_MS [.suffix]". Python API This is not available in Python API
foreachRDD(func) The most general output operator, which applies the function func to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to a file or writing it to a database over the network. Note that the function func is executed in the driver process that runs the streaming application, and usually the RDD operation is executed in it, which forces the computation of streaming RDD.

1) foreach RDD (func), correct and efficient use Through train

connection is an external link

Code

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

More efficient

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

Posted by DasHaas on Sat, 19 Jan 2019 04:24:13 -0800