The transformation operator we learned before cannot access the timestamp information and watermark information of the event. This is extremely important in some application scenarios. For example, map conversion operators such as MapFunction cannot access the timestamp or the event time of the current event.
Based on this, the DataStream API provides a series of low level conversion operators. You can access timestamps, watermark s, and register scheduled events. You can also output specific events, such as timeout events. Process Function is used to build event driven applications and implement custom business logic (which cannot be realized by using the previous window function and conversion operator). For example, FlinkSQL is implemented using Process Function.
Flink provides eight process functions:
- ProcessFunction
- KeyedProcessFunction
- CoProcessFunction
- ProcessJoinFunction
- BroadcastProcessFunction
- KeyedBroadcastProcessFunction
- ProcessWindowFunction
- ProcessAllWindowFunction
9.1 KeyedProcessFunction
This is a relatively common ProcessFunction. You can know that it is used on keyedStream according to its name.
KeyedProcessFunction is used to operate KeyedStream. KeyedProcessFunction will process each element of the stream and output 0, 1 or more elements. All process functions inherit from the RichFunction interface, so they have methods such as open(), close(), and getRuntimeContext(). KeyedProcessFunction < K, I, O > also provides two additional methods:
- Processelement (I value, context CTX, Collector < o > out). Each element in the flow will call this method, and the call result will be output in the Collector data type. Context can access the timestamp of the element, the key of the element, and the TimerService time service. Context can also output the results to other streams (side outputs).
- Ontimer (long timestamp, OnTimerContext CTX, collector < o > out) is a callback function. Called when a previously registered timer is triggered. The parameter timestamp is the trigger timestamp set by the timer. Collector is a collection of output results. OnTimerContext, like the Context parameter of processElement, provides some information about the Context, such as the time information triggered by the timer (event time or processing time).
Test code
Set a timer to give prompt information in the 5s after obtaining data
package processfunction; import apitest.beans.SensorReading; import org.apache.flink.api.common.state.ValueState; import org.apache.flink.api.common.state.ValueStateDescriptor; import org.apache.flink.api.java.tuple.Tuple; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.KeyedProcessFunction; import org.apache.flink.util.Collector; /** * @author : Ashiamd email: ashiamd@foxmail.com * @date : 2021/2/3 12:30 AM */ public class ProcessTest1_KeyedProcessFunction { public static void main(String[] args) throws Exception{ StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); // socket text stream DataStream<String> inputStream = env.socketTextStream("localhost", 7777); // Convert to SensorReading type DataStream<SensorReading> dataStream = inputStream.map(line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); }); // Test KeyedProcessFunction, group first, and then customize processing dataStream.keyBy("id") .process( new MyProcess() ) .print(); env.execute(); } // Implement custom processing functions public static class MyProcess extends KeyedProcessFunction<Tuple, SensorReading, Integer> { ValueState<Long> tsTimerState; @Override public void open(Configuration parameters) throws Exception { tsTimerState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("ts-timer", Long.class)); } @Override public void processElement(SensorReading value, Context ctx, Collector<Integer> out) throws Exception { out.collect(value.getId().length()); // context // Timestamp of the element currently being processed or timestamp of a firing timer. ctx.timestamp(); // Get key of the element being processed. ctx.getCurrentKey(); // ctx.output(); ctx.timerService().currentProcessingTime(); ctx.timerService().currentWatermark(); // Triggered after a 5 second delay of 5 processing times ctx.timerService().registerProcessingTimeTimer( ctx.timerService().currentProcessingTime() + 5000L); tsTimerState.update(ctx.timerService().currentProcessingTime() + 1000L); // ctx.timerService().registerEventTimeTimer((value.getTimestamp() + 10) * 1000L); // Deletes the timer triggered at the specified time // ctx.timerService().deleteProcessingTimeTimer(tsTimerState.value()); } @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<Integer> out) throws Exception { System.out.println(timestamp + " Timer trigger"); ctx.getCurrentKey(); // ctx.output(); ctx.timeDomain(); } @Override public void close() throws Exception { tsTimerState.clear(); } } }
9.2 TimerService and timers
The TimerService objects held by Context and OnTimerContext have the following methods:
-
long currentProcessingTime() Returns the current processing time
-
long currentWatermark() Returns the timestamp of the current watermark
-
void registerProcessingTimeTimer( long timestamp) The timer of the processing time of the current key will be registered. When the processing time reaches the timing time, the timer is triggered.
-
void registerEventTimeTimer(long timestamp) The event time timer of the current key will be registered. When the Watermark water mark is greater than or equal to the time registered by the timer, the timer is triggered to execute the callback function.
-
void deleteProcessingTimeTimer(long timestamp) Delete the previously registered processing time timer. If there is no timer with this timestamp, it will not be executed.
-
void deleteEventTimeTimer(long timestamp) Delete the previously registered event time timer. If there is no timer with this timestamp, it will not be executed.
When the timer timer is triggered, the callback function onTimer() is executed. Note that the timer timer can only be used on keyed streams.
Test code
The following example illustrates how KeyedProcessFunction operates KeyedStream.
Demand: monitor the temperature value of the temperature sensor. If the temperature value rises continuously within 10 seconds (processing time), it will give an alarm.
-
java code
package processfunction; import apitest.beans.SensorReading; import org.apache.flink.api.common.state.ValueState; import org.apache.flink.api.common.state.ValueStateDescriptor; import org.apache.flink.api.common.time.Time; import org.apache.flink.configuration.Configuration; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.KeyedProcessFunction; import org.apache.flink.util.Collector; /** * @author : Ashiamd email: ashiamd@foxmail.com * @date : 2021/2/3 1:02 AM */ public class ProcessTest2_ApplicationCase { public static void main(String[] args) throws Exception { // Create execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Set the parallelism to 1 env.setParallelism(1); // Get data from socket DataStream<String> inputStream = env.socketTextStream("localhost", 7777); // Convert data to SensorReading type DataStream<SensorReading> sensorReadingStream = inputStream.map(line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); }); // If there is a continuous temperature rise within 10s, an alarm will be given sensorReadingStream.keyBy(SensorReading::getId) .process(new TempConsIncreWarning(Time.seconds(10).toMilliseconds())) .print(); env.execute(); } // If there is a continuous temperature rise within 10s, an alarm will be given public static class TempConsIncreWarning extends KeyedProcessFunction<String, SensorReading, String> { public TempConsIncreWarning(Long interval) { this.interval = interval; } // Time interval of alarm (alarm if the temperature continues to rise within the interval) private Long interval; // Last temperature value private ValueState<Double> lastTemperature; // Trigger time of the last timer (alarm time) private ValueState<Long> recentTimerTimeStamp; @Override public void open(Configuration parameters) throws Exception { lastTemperature = getRuntimeContext().getState(new ValueStateDescriptor<Double>("lastTemperature", Double.class)); recentTimerTimeStamp = getRuntimeContext().getState(new ValueStateDescriptor<Long>("recentTimerTimeStamp", Long.class)); } @Override public void close() throws Exception { lastTemperature.clear(); recentTimerTimeStamp.clear(); } @Override public void processElement(SensorReading value, Context ctx, Collector<String> out) throws Exception { // Current temperature value double curTemp = value.getTemperature(); // Last temperature (if not, set to current temperature) double lastTemp = lastTemperature.value() != null ? lastTemperature.value() : curTemp; // Timer status value (timestamp) Long timerTimestamp = recentTimerTimeStamp.value(); // If current temperature > last temperature and no alarm timer is set if (curTemp > lastTemp && null == timerTimestamp) { long warningTimestamp = ctx.timerService().currentProcessingTime() + interval; ctx.timerService().registerProcessingTimeTimer(warningTimestamp); recentTimerTimeStamp.update(warningTimestamp); } // If the current temperature is less than the last temperature and the alarm timer is set, the timer is cleared else if (curTemp <= lastTemp && timerTimestamp != null) { ctx.timerService().deleteProcessingTimeTimer(timerTimestamp); recentTimerTimeStamp.clear(); } // Update saved temperature values lastTemperature.update(curTemp); } // Timer task @Override public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception { // The alarm is triggered and the timer status value is cleared out.collect("sensor" + ctx.getCurrentKey() + "Continuous temperature value" + interval + "ms rise"); recentTimerTimeStamp.clear(); } } }
-
Start the local socket and input data
nc -lk 7777
-
input
sensor_1,1547718199,35.8 sensor_1,1547718199,34.1 sensor_1,1547718199,34.2 sensor_1,1547718199,35.1 sensor_6,1547718201,15.4 sensor_7,1547718202,6.7 sensor_10,1547718205,38.1 sensor_10,1547718205,39 sensor_6,1547718201,18 sensor_7,1547718202,9.1
-
output
sensor sensor_1 Temperature value continuous 10000 ms rise sensor sensor_10 Temperature value continuous 10000 ms rise sensor sensor_6 Temperature value continuous 10000 ms rise sensor sensor_7 Temperature value continuous 10000 ms rise
-
9.3 side output
- A data can be contained by multiple windows. Only when it is not contained by any window (after all windows containing the data are closed) will it be thrown to the side output stream.
- In short, if a data is lost to the side output stream, all window s containing the data are closed because they have exceeded the "allowable lateness time", and the new lateness data can only be lost to the side output stream!
-
The output of most operators of DataStream API is a single output, that is, a stream of some data type. In addition to the split operator, a stream can be divided into multiple streams with the same data types.
-
The side outputs function of processfunction can generate multiple streams, and the data types of these streams can be different.
-
A side output can be defined as an OutputTag[X] object. X is the data type of the output stream.
-
processfunction can emit an event to one or more side outputs through the Context object.
Test code
Scenario: temperature > = 30 is put into high-temperature flow output, otherwise put into low-temperature flow output
-
java code
package processfunction; import apitest.beans.SensorReading; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.streaming.api.functions.ProcessFunction; import org.apache.flink.util.Collector; import org.apache.flink.util.OutputTag; /** * @author : Ashiamd email: ashiamd@foxmail.com * @date : 2021/2/3 2:07 AM */ public class ProcessTest3_SideOuptCase { public static void main(String[] args) throws Exception { // Create execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Set parallelism = 1 env.setParallelism(1); // Read data from local socket DataStream<String> inputStream = env.socketTextStream("localhost", 7777); // Convert to SensorReading type DataStream<SensorReading> dataStream = inputStream.map(line -> { String[] fields = line.split(","); return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2])); }); // Define an OutputTag to represent the low temperature flow of side output flow // An OutputTag must always be an anonymous inner class // so that Flink can derive a TypeInformation for the generic type parameter. OutputTag<SensorReading> lowTempTag = new OutputTag<SensorReading>("lowTemp"){}; // Test ProcessFunction and customize the side output stream to realize shunting operation SingleOutputStreamOperator<SensorReading> highTempStream = dataStream.process(new ProcessFunction<SensorReading, SensorReading>() { @Override public void processElement(SensorReading value, Context ctx, Collector<SensorReading> out) throws Exception { // Judge the temperature, if it is greater than 30 ℃, the high-temperature flow is output to the mainstream; Less than low temperature flow output to side output flow if (value.getTemperature() > 30) { out.collect(value); } else { ctx.output(lowTempTag, value); } } }); highTempStream.print("high-temp"); highTempStream.getSideOutput(lowTempTag).print("low-temp"); env.execute(); } }
-
Local boot socket
-
input
sensor_1,1547718199,35.8 sensor_6,1547718201,15.4 sensor_7,1547718202,6.7 sensor_10,1547718205,38.1
-
output
high-temp> SensorReading{id='sensor_1', timestamp=1547718199, temperature=35.8} low-temp> SensorReading{id='sensor_6', timestamp=1547718201, temperature=15.4} low-temp> SensorReading{id='sensor_7', timestamp=1547718202, temperature=6.7} high-temp> SensorReading{id='sensor_10', timestamp=1547718205, temperature=38.1}
-
9.4 CoProcessFunction
-
For two input streams, the DataStream API provides low-level operations such as CoProcessFunction. CoProcessFunction provides methods to operate each input stream: processElement1() and processElement2().
-
Similar to ProcessFunction, both methods are called through the Context object. This Context object can access event data, timer timestamp, TimerService, and side outputs.
-
CoProcessFunction also provides onTimer() callback function.