Data sources can be created by Stream Execution Environment. addSource (sourceFunction). Flink also provides some built-in data sources for easy use, such as readTextFile(path) readFile(), and of course, it can also write a custom data source (by implementing the SourceFunction method, but can not be implemented in parallel). That's ok. Or implement an interface Parallel Source Function that can be implemented in parallel or inherit Rich Parallel Source Function)
Introduction
Start with a simple introduction to building a DataStream SourceApp
Scala
object DataStreamSourceApp { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment socketFunction(env) env.execute("DataStreamSourceApp") } def socketFunction(env: StreamExecutionEnvironment): Unit = { val data=env.socketTextStream("192.168.152.45", 9999) data.print() } }
This method will read data from the socket, so we need to start the service in 192.168.152.45:
nc -lk 9999
Then run DataStream SourceApp and enter:
iie4bu@swarm-manager:~$ nc -lk 9999 apache flink spark
It also outputs in the console:
3> apache 4> flink 1> spark
The previous 341 represents parallelism. You can operate by setting up setParallelism:
data.print().setParallelism(1)
Java
public class JavaDataStreamSourceApp { public static void main(String[] args) throws Exception { StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(); socketFunction(environment); environment.execute("JavaDataStreamSourceApp"); } public static void socketFunction(StreamExecutionEnvironment executionEnvironment){ DataStreamSource<String> data = executionEnvironment.socketTextStream("192.168.152.45", 9999); data.print().setParallelism(1); } }
Customize the way to add data sources
Scala
Implementing SourceFunction Interface
This method can not be processed in parallel.
Create a new custom data source
class CustomNonParallelSourceFunction extends SourceFunction[Long]{ var count=1L var isRunning = true override def run(ctx: SourceFunction.SourceContext[Long]): Unit = { while (isRunning){ ctx.collect(count) count+=1 Thread.sleep(1000) } } override def cancel(): Unit = { isRunning = false } }
This method first defines an initial value of count=1L, and then executes the run method, which principally outputs count and performs an addition operation, which ends when cancel method is executed. The invocation method is as follows:
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // socketFunction(env) nonParallelSourceFunction(env) env.execute("DataStreamSourceApp") } def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = { val data=env.addSource(new CustomNonParallelSourceFunction()) data.print() }
The output is that the console always outputs count.
Parallelism cannot be set unless the parallelism is set to 1.
val data=env.addSource(new CustomNonParallelSourceFunction()).setParallelism(3)
Then the console reports an error:
Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55) at com.vincent.course05.DataStreamSourceApp$.nonParallelSourceFunction(DataStreamSourceApp.scala:16) at com.vincent.course05.DataStreamSourceApp$.main(DataStreamSourceApp.scala:11) at com.vincent.course05.DataStreamSourceApp.main(DataStreamSourceApp.scala)
Inheriting the ParallelSourceFunction method
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction} class CustomParallelSourceFunction extends ParallelSourceFunction[Long]{ var isRunning = true var count = 1L override def run(ctx: SourceFunction.SourceContext[Long]): Unit = { while(isRunning){ ctx.collect(count) count+=1 Thread.sleep(1000) } } override def cancel(): Unit = { isRunning=false } }
The function of the method is the same as above. The main method is as follows:
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // socketFunction(env) // nonParallelSourceFunction(env) parallelSourceFunction(env) env.execute("DataStreamSourceApp") } def parallelSourceFunction(env: StreamExecutionEnvironment): Unit = { val data=env.addSource(new CustomParallelSourceFunction()).setParallelism(3) data.print() }
Parallelism 3 can be set and the output is as follows:
2> 1 1> 1 2> 1 2> 2 3> 2 3> 2 3> 3 4> 3 4> 3
Inheritance of RichParallelSourceFunction method
class CustomRichParallelSourceFunction extends RichParallelSourceFunction[Long] { var isRunning = true var count = 1L override def run(ctx: SourceFunction.SourceContext[Long]): Unit = { while (isRunning) { ctx.collect(count) count += 1 Thread.sleep(1000) } } override def cancel(): Unit = { isRunning = false } }
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // socketFunction(env) // nonParallelSourceFunction(env) // parallelSourceFunction(env) richParallelSourceFunction(env) env.execute("DataStreamSourceApp") } def richParallelSourceFunction(env: StreamExecutionEnvironment): Unit = { val data = env.addSource(new CustomRichParallelSourceFunction()).setParallelism(3) data.print() }
Java
Implementing SourceFunction Interface
import org.apache.flink.streaming.api.functions.source.SourceFunction; public class JavaCustomNonParallelSourceFunction implements SourceFunction<Long> { boolean isRunning = true; long count = 1; @Override public void run(SourceFunction.SourceContext ctx) throws Exception { while (isRunning) { ctx.collect(count); count+=1; Thread.sleep(1000); } } @Override public void cancel() { isRunning=false; } }
public static void main(String[] args) throws Exception { StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(); // socketFunction(environment); nonParallelSourceFunction(environment); environment.execute("JavaDataStreamSourceApp"); } public static void nonParallelSourceFunction(StreamExecutionEnvironment executionEnvironment){ DataStreamSource data = executionEnvironment.addSource(new JavaCustomNonParallelSourceFunction()); data.print().setParallelism(1); }
When setting parallelism:
DataStreamSource data = executionEnvironment.addSource(new JavaCustomNonParallelSourceFunction()).setParallelism(2);
Then the error is abnormal:
Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55) at com.vincent.course05.JavaDataStreamSourceApp.nonParallelSourceFunction(JavaDataStreamSourceApp.java:16) at com.vincent.course05.JavaDataStreamSourceApp.main(JavaDataStreamSourceApp.java:10)
Implementing Parallel Source Function Interface
import org.apache.flink.streaming.api.functions.source.ParallelSourceFunction; public class JavaCustomParallelSourceFunction implements ParallelSourceFunction<Long> { boolean isRunning = true; long count = 1; @Override public void run(SourceContext ctx) throws Exception { while (isRunning) { ctx.collect(count); count+=1; Thread.sleep(1000); } } @Override public void cancel() { isRunning=false; } }
public static void main(String[] args) throws Exception { StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(); // socketFunction(environment); // nonParallelSourceFunction(environment); parallelSourceFunction(environment); environment.execute("JavaDataStreamSourceApp"); } public static void parallelSourceFunction(StreamExecutionEnvironment executionEnvironment){ DataStreamSource data = executionEnvironment.addSource(new JavaCustomParallelSourceFunction()).setParallelism(2); data.print().setParallelism(1); }
Parallelism can be set to output results:
1 1 2 2 3 3 4 4 5 5
Inheriting the abstract class RichParallelSourceFunction
public class JavaCustomRichParallelSourceFunction extends RichParallelSourceFunction<Long> { boolean isRunning = true; long count = 1; @Override public void run(SourceContext ctx) throws Exception { while (isRunning) { ctx.collect(count); count+=1; Thread.sleep(1000); } } @Override public void cancel() { isRunning=false; } }
public static void main(String[] args) throws Exception { StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(); // socketFunction(environment); // nonParallelSourceFunction(environment); // parallelSourceFunction(environment); richpParallelSourceFunction(environment); environment.execute("JavaDataStreamSourceApp"); } public static void richpParallelSourceFunction(StreamExecutionEnvironment executionEnvironment){ DataStreamSource data = executionEnvironment.addSource(new JavaCustomRichParallelSourceFunction()).setParallelism(2); data.print().setParallelism(1); }
Output results:
1 1 2 2 3 3 4 4 5 5 6 6