Spark Streaming provides two types of built-in streaming media sources.
Basic source: The source directly provided in the StreamingContext API. Example: File system and socket connection.
Advanced resources: Kafka, Flume, Kinesis and other resources can be obtained through additional utility classes.
The basic sources are as follows. Advanced sources can refer to official websites for example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala
Method 1. Create socket through socket
object GenerateChar { def generateContext(index : Int) : String = { import scala.collection.mutable.ListBuffer val charList = ListBuffer[Char]() for(i <- 65 to 90) charList += i.toChar val charArray = charList.toArray charArray(index).toString } def index = { import java.util.Random val rdm = new Random rdm.nextInt(7) } def main(args: Array[String]) { val listener = new ServerSocket(9998) while(true){ val socket = listener.accept() new Thread(){ override def run() = { println("Got client connected from :"+ socket.getInetAddress) val out = new PrintWriter(socket.getOutputStream,true) while(true){ Thread.sleep(500) val context = generateContext(index) //The resulting characters are the first seven random letters of the alphabet println(context) out.write(context + '\n') out.flush() } socket.close() } }.start() } } }
object ScoketStreaming { def main(args: Array[String]) { //Create a local Streaming Context with two workthreads val conf = new SparkConf().setMaster("local[2]").setAppName("ScoketStreaming") val sc = new StreamingContext(conf,Seconds(10)) //Statistics the total number of characters every 10 seconds //Create a DStream to connect master:9998 val lines = sc.socketTextStream("master",9998) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x , 1)).reduceByKey(_ + _) wordCounts.print() sc.start() //Start calculation sc.awaitTermination() //By manually terminating the calculation, otherwise it will continue to run. } }
Method 2. File stream
Spark Streaming monitors file system changes and reads new files as data streams if they are added.
It should be noted that:
1. These files have the same format
2. These files are created in the Data Directory by moving or renaming them atomically
3. Once these files are moved, they cannot be modified, and if additional content is added to the file, the additional new data will not be read.
object FileStreaming { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local").setAppName("FileStreaming") val sc = new StreamingContext(conf,Seconds(5)) val lines = sc.textFileStream("/home/hadoop/wordCount") val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x , 1)).reduceByKey(_ + _) sc.start() sc.awaitTermination() } }
Method 3. RDD queue flow
Use streamingContext.queueStream(queueOfRDD) to create a DStream based on RDD queue for debugging Spark Streaming applications.
QueueStream: The program creates an RDD every 1 second, and Streaming processes the data every 1 second.
object QueueStream { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName("queueStream") //Processing data per second val ssc = new StreamingContext(conf,Seconds(1)) //Create a RDDs queue that push es to Queue Input DStream val rddQueue = new mutable.SynchronizedQueue[RDD[Int]]() //Create an input source based on an RDD queue val inputStream = ssc.queueStream(rddQueue) val mappedStream = inputStream.map(x => (x % 10,1)) val reduceStream = mappedStream.reduceByKey(_ + _) reduceStream.print ssc.start() for(i <- 1 to 30){ rddQueue += ssc.sparkContext.makeRDD(1 to 100, 2) //Create RDD and assign two kernels Thread.sleep(1000) } ssc.stop() } }