spark structured streams, creating streaming DataFrame s and streaming Datasets

Keywords: Spark kafka JSON SQL

Create Streaming DataFrame and Streaming Datasets

Streaming DataFrames are available through SparkSession.readStream() The returned DataStreamReader interface (Scala / Java / Python document) is created.

Input Sources

Common built-in Sources

  • File source: Reads files from a specified directory as streaming data. Supported file formats are text, csv, json, parquet, orc, etc.
  • kafka source (commonly used): Read data from kafka
  • Socket source (test use): Read UTF-8 encoded text data from a socket connection
Source Options Fault-tolerant Notes
File source 1. **path:**Path to the input directory and is common to all file formats.2. **maxFilesPerTrigger:** Maximum number of new files to consider in each interval trigger (default: no maximum) 3. **latestFirst:** Whether to process the latest new files first is useful when there is a large backlog (default:false).4. **fileNameOnly:** Whether to check for new files only based on file name, not full path (default: false).Set this to "true" and the following files will be treated as the same because their filenames are "dataset.txt"Is the same:..."file:///dataset.txt"."s3://a/dataset.txt" Yes glob paths are supported, but multiple comma-separated poths/ globs are not.
Socket Source **host:**Host connection, must specify **port:**Port to connect, must specify No
Kafka Source Check out the kafka integration tutorial in more detail Yes

Some examples:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.4.4</version>
package example8

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType

object StructuredStreamingSource {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[4]").setAppName("StructuredNetworkWordCount")
    val sc = new SparkContext(conf)
    sc.setLogLevel("FATAL")
    val spark = SparkSession.builder().getOrCreate()
    import spark.implicits._
    // =====================================CSV===========================================
    /*
    val df = spark.readStream
      .format("csv")
      //.option("sep",";")
      .schema(new StructType().add("id", "integer").add("name", "string").add("salary", "double"))
      .csv("/Users/gaozhy/data/csv")
     */
    // =====================================CSV===========================================

    // =====================================json===========================================
    /*
    val df = spark.readStream
      .format("json")
      .schema(new StructType().add("id","integer").add("name","string").add("salary","float"))
      .json("/Users/gaozhy/data/json")
    */
    // =====================================json===========================================

    //df.createOrReplaceTempView("t_user")
    /*
    spark.sql("select * from t_user")
      .writeStream
      .outputMode("append")
      .format("console")
      .start()
      .awaitTermination()
    */
    
    // =====================================kafka===========================================
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "spark:9092")
      .option("startingOffsets", """{"bz":{"0":-2}}""") // Specify offset consumption
      .option("subscribe", "baizhi")
      .load()

    val kafka = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "topic", "partition")
      .as[(String, String, String, Int)]

    kafka.createOrReplaceTempView("t_kafka")
//    spark.sql("select * from t_kafka")
//      .writeStream
//      .format("console")
//      .outputMode("append")
//      .start()
//      .awaitTermination()

    spark.sql("select count(*) from t_kafka group by partition")
      .writeStream
      .format("console")
      .outputMode("complete")
      .start()
      .awaitTermination()
  }
}

Output Modes

Here are several output modes:

  • Append mode (default) - This is the default mode where only new rows added to the result table since the last trigger will be output to sink.This is only supported for queries that are added to the Result Table and never change rows.Therefore, this pattern guarantees that each line can only be output once (assuming fault tolerant sink).For example, only select, where, map, flatMap, filter, join, and so on queries will support Append mode.
  • Complete mode - After each trigger, the entire result table is output to sink.Aggregate queries support this.
  • Update mode - (available since Spark 2.1.1) Only rows in the result table that have been updated since the last trigger are output to sink.

Different types of streaming queries support different output modes.Following is the compatibility information.

Query Type Supported Output Modes Remarks
Queries without aggregation Append, Update Full mode is not supported because it is not feasible to save all data in the result table.
Queries with aggregation: use watermark to aggregate event-time Append, Update, Complete Additional patterns use watermarking to reduce old aggregation states.However, the output of windowed aggregation delays the late threshold specified in "withWatermark()" because schema semantics can be defined in the result table before the result table is added to the result table (that is, after the watermark is crossed).For more information, see the Later Data section.Update mode uses watermarks to remove old aggregation states.Full mode does not discard the old aggregation state because by definition, it retains all the data in the result table.
Queries with aggregates: other aggregates Complete, Update Since no watermark is defined (only in other categories), the old aggregation state will not be discarded.Additional patterns are not supported because aggregation can be updated, violating the semantics of this pattern.

Output Sinks

There are several types of built-in output sinks

  • File sink - Stores the output to a directory
  • Kafka sink - Stores the output to one or more topics in Kafka.
  • Foreach sink - Runs arbitrary (arbitrary) computation on the records in the output.
  • Console sink (for debugging) - Prints the output to the console/stdout every time there is a trigger
sink Supported Output Modes Options Fault-tolerant Notes
File Sink Append Path: The path to the output directory, which must be specified.maxFilesPerTrigger: Maximum number of new files to consider in each trigger (default: no maximum) latestFirst: Whether to process the latest new files first, useful for file format-specific options when there is a large backlog (default: false), see the related methods in DataFrameWriter (Scala / Java / Python).For example.For "parquet" format options see DataFrameWriter.parquet() yes Supports writing partition tables.Time division may be useful.
Foreach Sink Append, Update, Compelete None Depends on ForeachWriter implementation More details in the next section
Console Sink Append, Update, Complete numRows: Number of rows per trigger (default: 20) truncate: output too long truncate (default: true) no
Memory Sink Append, Complete None No.However, in Complete mode, the restarted query recreates the entire table. Query name is table name
package com.baizhi

import org.apache.spark.sql.{ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
import redis.clients.jedis.Jedis

/**
  * How to construct the output of structured flow calculation structure with different sink s
  */
object SparkStructuredStreamingForOutputSink {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName("input source")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")
    import spark.implicits._


    //-------------------------------------------------------------------------------------------
    val df = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092")
      .option("startingOffsets", """{"bz":{"0":-2}}""") // Specify earliest consumption mode for consuming partition 0 of Baizhi top
      // Specify offset consumption In the json, -2 as an offset can be used to refer to earliest, -1 to latest.
      .option("subscribe", "bz")
      .load

    // kafka record converted to required type
    //    val ds = df
    //      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS INT)", "CAST(timestamp AS LONG)")
    //      .as[(String, String, String, Int, Long)] // key value topic partition Long
    //
    //    ds.createOrReplaceTempView("t_kafka")
    //
    //    val text = spark.sql("select key as k,value as v,topic as t,partition as p, timestamp as ts from t_kafka")
    //-------------------------------------------------------------------------------------------

    //    text
    //      .writeStream
    //      .format("console")
    //      .outputMode(OutputMode.Append())
    //      .start()
    //      .awaitTermination()

    //================================================File [Output mode only supports Append]==================================
    //    text
    //      .writeStream
    //      .format("json")//file format CSV JSON parquet ORC, etc.
    //      .outputMode(OutputMode.Append())
    //      Option ("checkpointLocation"),Hdfs://Spark: 9000/checkpoint1 ") //checkpoint path for failure recovery
    //      .option("path","file:///D://result") // path supports both local and HDFS file system paths
    //      .start()
    //      .awaitTermination()

    //    text
    //      .writeStream
    //      .format("csv") //file format CSV JSON parquet ORC, etc.
    //      .outputMode(OutputMode.Append())
    //      option("checkpointLocation", "hdfs://Spark:9000/checkpoint2 ") //checkpoint path for failure recovery
    //      .option("path", "file:///D://result2") // path supports both local and HDFS file system paths
    //      .start()
    //      .awaitTermination()


    //=====================================================Kafka [Output mode: Append | Updated | Completed]===================
    //    val ds = df
    //      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS INT)", "CAST(timestamp AS LONG)")
    //      .as[(String, String, String, Int, Long)].flatMap(_._2.split(" ")).map((_, 1)) // key value topic partition Long
    //
    //    ds.createOrReplaceTempView("t_kafka")
    //
    //    val text = spark.sql("select _1 as word,count(_2) as num from t_kafka group by _1")
    //
    //    text
    //      // selectExpr("CAST(k AS STRING) as key", "CAST(v AS STRING) as value") // Define key and value information for data exported to kafka
    //      .selectExpr("CAST(word AS STRING) as key", "CAST(num AS STRING) as value")//Define key and value information for data output to kafka
    //      .writeStream
    //      .format("kafka")//file format CSV JSON parquet ORC, etc.
    //      // .outputMode(OutputMode.Append())
    //      .outputMode(OutputMode.Update())
    //      option("checkpointLocation", "hdfs://Spark:9000/checkpoint4 ") //checkpoint path for failure recovery
    //      Option ("Kafka.bootstrap.servers"," "HadoopNode01:9092, HadoopNode02:9092, HadoopNode03:9092"// Kafka cluster information"
    //      option("topic", "result")//Save topic for specified calculation results
    //      .start()
    //      .awaitTermination()

    /*
      [root@HadoopNode02 kafka_2.11-2.2.0]# bin/kafka-console-consumer.sh --topic result --bootstrap-server HadoopNode01:9092,HadoopNode02:9092,HadoopNode03:9092 --property print.key=true
      */

    //======================================================Foreach [Output mode: Append | Updated | Completed]==========================
    // Output calculation results to redis
    //kafka record converted to required type

    val ds = df
      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "CAST(topic AS STRING)", "CAST(partition AS INT)", "CAST(timestamp AS LONG)")
      .as[(String, String, String, Int, Long)].flatMap(_._2.split(" ")).map((_, 1)) // key value topic partition Long

    ds.createOrReplaceTempView("t_kafka")

    val text = spark.sql("select _1 as word,count(_2) as num from t_kafka group by _1")
    text
      .writeStream
      .outputMode(OutputMode.Update())
      .foreach(new ForeachWriter[Row] {
        /**
          * Open Method
          *
          * @param partitionId Partition Sequence Number
          * @param epochId
          * @return Boolean  true Create a connection to process this row of data
          *         false No connection will be created Skipping this row of data
          */
        override def open(partitionId: Long, epochId: Long): Boolean = true

        /**
          * processing method
          *
          * @param value resultTable Row object
          */
        override def process(value: Row): Unit = {
          val word = value.getString(0)
          val count = value.getLong(1).toString

          val jedis = new Jedis("Spark", 6379)
          jedis.set(word, count)
          jedis.close()
        }

        override def close(errorOrNull: Throwable): Unit = if (errorOrNull != null) errorOrNull.printStackTrace()

      }) // Apply write out rules to each row of records in the resultTable
      .option("checkpointLocation", "hdfs://Spark:9000/checkpoint4")//checkpoint path for failure recovery
      .start()
      .awaitTermination()
  }
}

Posted by IwnfuM on Sun, 07 Jun 2020 18:13:27 -0700