Apache Flink Zero Foundation Initial Flink Data Stream Programming

Keywords: Programming Java Apache Scala Spark

Data sources can be created by Stream Execution Environment. addSource (sourceFunction). Flink also provides some built-in data sources for easy use, such as readTextFile(path) readFile(), and of course, it can also write a custom data source (by implementing the SourceFunction method, but can not be implemented in parallel). That's ok. Or implement an interface Parallel Source Function that can be implemented in parallel or inherit Rich Parallel Source Function)

Introduction

Start with a simple introduction to building a DataStream SourceApp

Scala

object DataStreamSourceApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    socketFunction(env)
        env.execute("DataStreamSourceApp")
  }

  def socketFunction(env: StreamExecutionEnvironment): Unit = {
    val data=env.socketTextStream("192.168.152.45", 9999)
    data.print()
  }
}

This method will read data from the socket, so we need to start the service in 192.168.152.45:

nc -lk 9999

Then run DataStream SourceApp and enter:

iie4bu@swarm-manager:~$ nc -lk 9999
apache
flink
spark

It also outputs in the console:

3> apache
4> flink
1> spark

The previous 341 represents parallelism. You can operate by setting up setParallelism:

data.print().setParallelism(1)

Java

public class JavaDataStreamSourceApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        socketFunction(environment);
        environment.execute("JavaDataStreamSourceApp");
    }
    public static void socketFunction(StreamExecutionEnvironment executionEnvironment){
        DataStreamSource<String> data = executionEnvironment.socketTextStream("192.168.152.45", 9999);
        data.print().setParallelism(1);
    }
}

Customize the way to add data sources

Scala

Implementing SourceFunction Interface

This method can not be processed in parallel.

Create a new custom data source

class CustomNonParallelSourceFunction extends SourceFunction[Long]{

  var count=1L
  var isRunning = true


  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning){
      ctx.collect(count)
      count+=1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

This method first defines an initial value of count=1L, and then executes the run method, which principally outputs count and performs an addition operation, which ends when cancel method is executed. The invocation method is as follows:

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //    socketFunction(env)
    nonParallelSourceFunction(env)
    env.execute("DataStreamSourceApp")
  }

  def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data=env.addSource(new CustomNonParallelSourceFunction())
    data.print()
  }

The output is that the console always outputs count.

Parallelism cannot be set unless the parallelism is set to 1.

val data=env.addSource(new CustomNonParallelSourceFunction()).setParallelism(3)

Then the console reports an error:

Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55)
	at com.vincent.course05.DataStreamSourceApp$.nonParallelSourceFunction(DataStreamSourceApp.scala:16)
	at com.vincent.course05.DataStreamSourceApp$.main(DataStreamSourceApp.scala:11)
	at com.vincent.course05.DataStreamSourceApp.main(DataStreamSourceApp.scala)

Inheriting the ParallelSourceFunction method

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}

class CustomParallelSourceFunction extends ParallelSourceFunction[Long]{

  var isRunning = true
  var count = 1L


  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while(isRunning){
      ctx.collect(count)
      count+=1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning=false
  }
}

The function of the method is the same as above. The main method is as follows:

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //    socketFunction(env)
//    nonParallelSourceFunction(env)
    parallelSourceFunction(env)


    env.execute("DataStreamSourceApp")
  }

  def parallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data=env.addSource(new CustomParallelSourceFunction()).setParallelism(3)
    data.print()
  }

Parallelism 3 can be set and the output is as follows:

2> 1
1> 1
2> 1
2> 2
3> 2
3> 2
3> 3
4> 3
4> 3

Inheritance of RichParallelSourceFunction method

class CustomRichParallelSourceFunction extends RichParallelSourceFunction[Long] {
  var isRunning = true
  var count = 1L


  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //    socketFunction(env)
    //    nonParallelSourceFunction(env)
//    parallelSourceFunction(env)
    richParallelSourceFunction(env)

    env.execute("DataStreamSourceApp")
  }

  def richParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomRichParallelSourceFunction()).setParallelism(3)
    data.print()
  }

Java

Implementing SourceFunction Interface

import org.apache.flink.streaming.api.functions.source.SourceFunction;

public class JavaCustomNonParallelSourceFunction implements SourceFunction<Long> {

    boolean isRunning = true;
    long count = 1;

    @Override
    public void run(SourceFunction.SourceContext ctx) throws Exception {
        while (isRunning) {
            ctx.collect(count);
            count+=1;
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRunning=false;
    }
}
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(environment);
        nonParallelSourceFunction(environment);
        environment.execute("JavaDataStreamSourceApp");

    }

    public static void nonParallelSourceFunction(StreamExecutionEnvironment executionEnvironment){
        DataStreamSource data = executionEnvironment.addSource(new JavaCustomNonParallelSourceFunction());
        data.print().setParallelism(1);
    }

When setting parallelism:

        DataStreamSource data = executionEnvironment.addSource(new JavaCustomNonParallelSourceFunction()).setParallelism(2);

Then the error is abnormal:

Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55)
	at com.vincent.course05.JavaDataStreamSourceApp.nonParallelSourceFunction(JavaDataStreamSourceApp.java:16)
	at com.vincent.course05.JavaDataStreamSourceApp.main(JavaDataStreamSourceApp.java:10)

Implementing Parallel Source Function Interface

import org.apache.flink.streaming.api.functions.source.ParallelSourceFunction;

public class JavaCustomParallelSourceFunction implements ParallelSourceFunction<Long> {

    boolean isRunning = true;
    long count = 1;

    @Override
    public void run(SourceContext ctx) throws Exception {
        while (isRunning) {
            ctx.collect(count);
            count+=1;
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRunning=false;
    }
}
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(environment);
//        nonParallelSourceFunction(environment);
        parallelSourceFunction(environment);

        environment.execute("JavaDataStreamSourceApp");
    }

    public static void parallelSourceFunction(StreamExecutionEnvironment executionEnvironment){
        DataStreamSource data = executionEnvironment.addSource(new JavaCustomParallelSourceFunction()).setParallelism(2);
        data.print().setParallelism(1);
    }

Parallelism can be set to output results:

1
1
2
2
3
3
4
4
5
5

Inheriting the abstract class RichParallelSourceFunction

public class JavaCustomRichParallelSourceFunction extends RichParallelSourceFunction<Long> {

    boolean isRunning = true;
    long count = 1;

    @Override
    public void run(SourceContext ctx) throws Exception {
        while (isRunning) {
            ctx.collect(count);
            count+=1;
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRunning=false;
    }
}
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(environment);
//        nonParallelSourceFunction(environment);
//        parallelSourceFunction(environment);
        richpParallelSourceFunction(environment);
        environment.execute("JavaDataStreamSourceApp");
    }

    public static void richpParallelSourceFunction(StreamExecutionEnvironment executionEnvironment){
        DataStreamSource data = executionEnvironment.addSource(new JavaCustomRichParallelSourceFunction()).setParallelism(2);
        data.print().setParallelism(1);
    }

Output results:

1
1
2
2
3
3
4
4
5
5
6
6

Relationships between SourceFunction, ParallelSourceFunction, RichParallelSourceFunction classes

Posted by McInfo on Tue, 10 Sep 2019 02:57:29 -0700