Spark Learning 02 - A Method to Create DStream

Keywords: socket Spark Scala Apache

Spark Streaming provides two types of built-in streaming media sources.

Basic source: The source directly provided in the StreamingContext API. Example: File system and socket connection.
Advanced resources: Kafka, Flume, Kinesis and other resources can be obtained through additional utility classes.

The basic sources are as follows. Advanced sources can refer to official websites for example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala

Method 1. Create socket through socket

object GenerateChar {
  def generateContext(index : Int) : String = {
    import scala.collection.mutable.ListBuffer
    val charList = ListBuffer[Char]()
    for(i <- 65 to 90)
      charList += i.toChar
    val charArray = charList.toArray
    charArray(index).toString
  }
  def index = {
    import  java.util.Random
    val rdm = new Random
    rdm.nextInt(7) 
  }
  def main(args: Array[String]) {
    val listener = new ServerSocket(9998)
    while(true){
      val socket = listener.accept()
      new Thread(){
        override def run() = {
          println("Got client connected from :"+ socket.getInetAddress)
          val out = new PrintWriter(socket.getOutputStream,true)
          while(true){
            Thread.sleep(500)
            val context = generateContext(index)  //The resulting characters are the first seven random letters of the alphabet
            println(context)
            out.write(context + '\n')
            out.flush()
          }
          socket.close()
        }
      }.start()
    }
  }
}
object ScoketStreaming {
  def main(args: Array[String]) {
    //Create a local Streaming Context with two workthreads
    val conf = new SparkConf().setMaster("local[2]").setAppName("ScoketStreaming")
    val sc = new StreamingContext(conf,Seconds(10))   //Statistics the total number of characters every 10 seconds
    //Create a DStream to connect master:9998
    val lines = sc.socketTextStream("master",9998)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x , 1)).reduceByKey(_ + _)
    wordCounts.print()
    sc.start()         //Start calculation
    sc.awaitTermination()   //By manually terminating the calculation, otherwise it will continue to run.
  }
}

Method 2. File stream

Spark Streaming monitors file system changes and reads new files as data streams if they are added.
It should be noted that:
1. These files have the same format
2. These files are created in the Data Directory by moving or renaming them atomically
3. Once these files are moved, they cannot be modified, and if additional content is added to the file, the additional new data will not be read.

object FileStreaming {
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local").setAppName("FileStreaming")
    val sc = new StreamingContext(conf,Seconds(5))
    val lines = sc.textFileStream("/home/hadoop/wordCount")
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x , 1)).reduceByKey(_ + _)
    sc.start()
    sc.awaitTermination()
  }
}

Method 3. RDD queue flow

Use streamingContext.queueStream(queueOfRDD) to create a DStream based on RDD queue for debugging Spark Streaming applications.
QueueStream: The program creates an RDD every 1 second, and Streaming processes the data every 1 second.

object QueueStream {
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName("queueStream")
    //Processing data per second
    val ssc = new StreamingContext(conf,Seconds(1))
    //Create a RDDs queue that push es to Queue Input DStream
    val rddQueue = new mutable.SynchronizedQueue[RDD[Int]]()
    //Create an input source based on an RDD queue
    val inputStream = ssc.queueStream(rddQueue)
    val mappedStream = inputStream.map(x => (x % 10,1))
    val reduceStream = mappedStream.reduceByKey(_ + _)
    reduceStream.print
    ssc.start()
    for(i <- 1 to 30){
      rddQueue += ssc.sparkContext.makeRDD(1 to 100, 2)   //Create RDD and assign two kernels
      Thread.sleep(1000)                                  
    }
    ssc.stop()
  }
}

Posted by leebo on Thu, 03 Oct 2019 10:21:03 -0700