spark advanced: spark streaming usage Structured Streaming

Keywords: Big Data kafka Spark

Spark 2.0 has produced a new stream processing framework Structured Streaming, which is a scalable and fault-tolerant stream processing engine built on Spark SQL Engine. Using Structured Streaming, you can perform streaming computing on static data (Dataset/DataFrame) like batch computing. With the continuous arrival of data, Spark SQL Engine will process it incrementally and continuously, and update the final result.

Simply put, DSteam is a DSteam based on RDD, and Structured Streaming is based on Dataset(DataFrame).

By default, Structured Streaming uses the micro batch engine to process the data stream as a series of small batch jobs, so as to achieve end-to-end latency as low as 100 milliseconds. Since Spark 2.3, a new low latency processing mode, called continuous processing, has been introduced, which further reduces the end-to-end delay to 1 millisecond. For developers, there is no need to consider whether it is streaming computing or batch processing. As long as computing operations are written in the same way, Structured Streaming will automatically realize fast, scalable, fault-tolerant and other processing at the bottom.

1, Simple use

Add dependency:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
  <version>3.1.2</version>
</dependency>

Test code:

/**
 * @author: ffzs
 * @Date: 2021/10/10 1:21 PM
 */
object StructuredStreaming {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)

    // sparkSession
    val spark = SparkSession.builder
      .appName("StructuredStreamingExample")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val values = spark.readStream.format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "topictest")
      .load()
      .selectExpr("CAST(value AS STRING)")   // Convert data to string format through SQL expression
      .as[String]   // Convert to DataSet

    val wordCounts = values.flatMap(_.split(" ")).groupBy("value").count()  //Count the number by aggregation

    val query = wordCounts.writeStream
      .outputMode("complete")  // Output mode
      .format("console")   // Result printing
      .option("checkpointLocation", "hdfs://Localhost: 9000 / Kafka CK ") / / set checkpoint
      .start()

    query.awaitTermination()   // Wait for query termination
  }
}
  • Create local SparkSession
  • Create a streaming DataFrame link kafka, where the table has only one column, named value
  • Convert to DataSet through as[String]
  • Then do the statistics of words through aggregation

Output mode

  • Complete Mode: the whole updated result table will be written to external storage. How to handle the writing of the whole table is determined by the storage connector.
  • Append Mode: the default mode. Since the last trigger, only the appended new rows in the result table are written to the external storage. This is only applicable to queries that do not expect to be changed for existing rows already in the result table, such as select, where, map, fatMap, filter, join and other operations. This mode is supported.
  • Update Mode: only rows updated (including additions) in the result table since the last trigger are written to the external storage (available since Spark 2.1.1). This is different from the complete mode, which outputs only rows changed since the last trigger. If the query does not contain aggregation, it is equivalent to append mode.

External storage

1. Documents

Output files to the specified directory. Only append mode is supported.

val query = wordCounts.writeStream
  .format("parquet")
  .option("path", "hdfs://localhost:9000/structuredWordCount")
  .start()

2.kafka

Output the calculation results to topic.

val query = wordCounts.writeStream
  .outputMode("complete")
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "topictest")
  .option("checkpointLocation", "hdfs://localhost:9000/kafka-ck1")
  .start()

3. Console

Print through the console.

val query1 = wordCounts.writeStream
  .outputMode("complete")  // Output mode, complete: output all contents, append: output new rows, update: output updated rows
  .format("console")   // Result printing
  .option("checkpointLocation", "hdfs://Localhost: 9000 / Kafka CK ") / / set checkpoint
  .start()

4. Memory

The calculation results are stored in memory as a table in memory for debugging a small amount of data.

val query = wordCounts.writeStream
  .outputMode("complete")
  .format("memory")
  .queryName("wordCount")
  .option("checkpointLocation", "hdfs://localhost:9000/kafka-ck1")
  .start()

Posted by kimandrel on Sat, 09 Oct 2021 22:42:45 -0700