spark-streaming sample program

Keywords: Big Data Spark Apache Maven Scala

Develop spark-streaming to receive data worldcount from server port in real time.

Environment building

idea+maven's pom file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.yzztech.shushuo</groupId>
    <artifactId>myspark</artifactId>
    <version>1.0-SNAPSHOT</version>


    <repositories>
        <repository>
            <id>cloudera</id>
            <url>http://repository.cloudera.com/artifactory/cloudera-repos</url>
        </repository>
    </repositories>

    <properties>
        <spark.version>1.6.0-cdh5.14.0</spark.version>
        <scala.version>2.10</scala.version>
    </properties>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.10</artifactId>
            <version>${spark.version}</version>
            <scope>runtime</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka_2.10</artifactId>
            <version>1.6.0-cdh5.14.0</version>
        </dependency>







    </dependencies>



</project>

spark version is 1.6;

idea installation skipped;

The code is as follows

val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))
    Logger.getLogger("org").setLevel(Level.ERROR)

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

    
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()

Submission of orders

spark-submit --class StreamingTest /home/ubuntu/lixin_test/myspark.jar

//Send port messages before submitting tasks
nc -lk 9999

episode

The default after submitting the command above is not cluster mode, which means that the driver is local and can see the output, but not if the cluster mode is specified.

 spark-submit --class StreamingTest \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 1g \
    --executor-memory  1g \
    --executor-cores 2 \
    --driver-cores 2 \
    --num-executors 3 \
    /home/ubuntu/myspark.jar \
    localhost 9999

You can't see the output when submitting with the above command, but the task is running, and if you want to see the output, you can save the result to hdfs.

wordCounts.saveAsTextFiles("/spark/spark-streaming-out")

Posted by phpnew on Wed, 30 Jan 2019 12:21:15 -0800