Develop spark-streaming to receive data worldcount from server port in real time.
Environment building
idea+maven's pom file is as follows:
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.yzztech.shushuo</groupId> <artifactId>myspark</artifactId> <version>1.0-SNAPSHOT</version> <repositories> <repository> <id>cloudera</id> <url>http://repository.cloudera.com/artifactory/cloudera-repos</url> </repository> </repositories> <properties> <spark.version>1.6.0-cdh5.14.0</spark.version> <scala.version>2.10</scala.version> </properties> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>${spark.version}</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.10</artifactId> <version>${spark.version}</version> <scope>runtime</scope> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.6.0-cdh5.14.0</version> </dependency> </dependencies> </project>
spark version is 1.6;
idea installation skipped;
The code is as follows
val sparkConf = new SparkConf().setAppName("NetworkWordCount") val ssc = new StreamingContext(sparkConf, Seconds(5)) Logger.getLogger("org").setLevel(Level.ERROR) // Create a socket stream on target ip:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') // Note that no duplication in storage level only for running locally. // Replication necessary in distributed scenario for fault tolerance. val lines = ssc.socketTextStream(args(0), args(1).toInt) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
Submission of orders
spark-submit --class StreamingTest /home/ubuntu/lixin_test/myspark.jar //Send port messages before submitting tasks nc -lk 9999
episode
The default after submitting the command above is not cluster mode, which means that the driver is local and can see the output, but not if the cluster mode is specified.
spark-submit --class StreamingTest \ --master yarn \ --deploy-mode cluster \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 2 \ --driver-cores 2 \ --num-executors 3 \ /home/ubuntu/myspark.jar \ localhost 9999
You can't see the output when submitting with the above command, but the task is running, and if you want to see the output, you can save the result to hdfs.
wordCounts.saveAsTextFiles("/spark/spark-streaming-out")