Instances of conversion operations for the core DStream of Spark Streaming

Keywords: Big Data Spark Scala Apache socket

Conversion operation of DStream

The DStream API provides the following methods related to transformation operations:

Examples of transform(func) and updateStateByKey(func) methods are given below:

(1), transform(func) method

transform methods and similar transformWith(func) methods allow any RDD-to-RDD function to be applied on DStream and can be applied to any RDD operation that is not exposed in the DStream API.
The following example demonstrates how to use the transform(func) method to separate a line of statements into multiple words, using the following steps:
A. Execute the command nc-lk 9999 in Liunx to start the server and listen for socket services, and enter the data I like spark streaming and Hadoop as follows:

B. Open IDEA development tools, create a Maven project, and configurePom.xmlFile, introducing dependent packages related to Spark Streaming,Pom.xmlThe file configuration is as follows:
<!--Set Dependent Version Number-->

<properties>
    <spark.version>2.1.1</spark.version>
    <scala.version>2.11</scala.version>
</properties>
<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
</dependencies>

Note: ConfiguredPom.xmlAfter the file, you need to create scala directories in the project's/src/main and/src/test directories, respectively.
C. Create a package in the project's/src/main/scala directory, then create a Scala class named TransformTest, which is mainly used to write SparkStreaming applications to separate one line of statements into multiple words, as follows (with comments):

package SparkStreaming

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

object TransformTest {
  def main(args: Array[String]): Unit = {
    //Create SparkConf Object
    val sparkconf = new SparkConf().setAppName("TransformTest").setMaster("local[2]")
    //Create SparkContext Object
    val sc = new SparkContext(sparkconf)
    //Set Log Level
    sc.setLogLevel("WARN")
    //To create a StreamingContext, you need to create two parameters, SparkContext and Batch Interval
    val ssc = new StreamingContext(sc,Seconds(5))
    //Connecting to a socket service requires the socket service address, port number, and storage level (default)
    val dstream:ReceiverInputDStream[String] = ssc.socketTextStream("192.168.169.200",9999)
    //Separate by spaces
    val words:DStream[String] = dstream.transform(line => line.flatMap(_.split(" ")))
    //Print Output Results
    words.print()
    //Turn on streaming
    ssc.start()
    //Used to protect program from running normally
    ssc.awaitTermination()
  }
}

D. Running the program, you can see that statements I like spark streaming and Hadoop are divided into six words in five seconds, the result is as follows:

(2) updateStateByKey(func) method

The updateStateByKey(func) method can remain arbitrary while allowing continuous updates of new information.
The following example demonstrates how to use the updateStateByKey(func) method for word frequency statistics:
Create a package in the project's/src/main/scala directory, then create a Scala class named UpdateStateByKeyTest, which is mainly used to write SparkStreaming applications to implement word frequency statistics as follows:

package SparkStreaming

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

object UpdateStateByKeyTest {

  def updateFunction(newValues:Seq[Int],runningCount:Option[Int]) : Option[Int] = {
    val newCount = runningCount.getOrElse(0)+newValues.sum
    Some(newCount)
  }
  def main(args: Array[String]): Unit = {
    //Create SparkConf Object
    val sparkconf = new SparkConf().setAppName("UpdateStateByKeyTest").setMaster("local[2]")
    //Create SparkContext Object
    val sc = new SparkContext(sparkconf)
    //Set Log Level
    sc.setLogLevel("WARN")
    //To create a StreamingContext, you need to create two parameters, SparkContext and Batch Interval
    val ssc = new StreamingContext(sc,Seconds(5))
    //Connecting to a socket service requires the socket service address, port number, and storage level (default)
    val dstream:ReceiverInputDStream[String] = ssc.socketTextStream("192.168.169.200",9999)
    //Separate the first and second fields by commas
    val words:DStream[(String,Int)] = dstream.flatMap(_.split(" ")).map(word => (word,1))
    //Call updateStateByKey operation
    var result:DStream[(String,Int)] = words.updateStateByKey(updateFunction)
    //If updateStateByKey is used, add it hereSsc.checkpoint("Catalog") or you will get an error:
    // The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().
    //Why checkpoint is used?Because the previous cases are stateless, they will be discarded after they are used up, which is not necessary.
    // But to use that data now, keep the previous state
    //"." means the current directory
    ssc.checkpoint(".")
    //Print Output Results
    result.print()
    //Turn on streaming
    ssc.start()
    //Used to protect program from running normally
    ssc.awaitTermination()
  }
}

Then enter words on port 9999 as follows:


From the results of the console output, the running program can see that it accepts data every 5 seconds, twice in total, and every time it accepts data, it performs word frequency statistics and outputs the results.

Posted by buildakicker on Sat, 23 May 2020 12:15:37 -0700