Processfunction API (underlying API)

Keywords: flink

The transformation operator we learned before cannot access the timestamp information and watermark information of the event. This is extremely important in some application scenarios. For example, map conversion operators such as MapFunction cannot access the timestamp or the event time of the current event.

Based on this, the DataStream API provides a series of low level conversion operators. You can access timestamps, watermark s, and register scheduled events. You can also output specific events, such as timeout events. Process Function is used to build event driven applications and implement custom business logic (which cannot be realized by using the previous window function and conversion operator). For example, FlinkSQL is implemented using Process Function.

Flink provides eight process functions:

  • ProcessFunction
  • KeyedProcessFunction
  • CoProcessFunction
  • ProcessJoinFunction
  • BroadcastProcessFunction
  • KeyedBroadcastProcessFunction
  • ProcessWindowFunction
  • ProcessAllWindowFunction

9.1 KeyedProcessFunction

This is a relatively common ProcessFunction. You can know that it is used on keyedStream according to its name.

KeyedProcessFunction is used to operate KeyedStream. KeyedProcessFunction will process each element of the stream and output 0, 1 or more elements. All process functions inherit from the RichFunction interface, so they have methods such as open(), close(), and getRuntimeContext(). KeyedProcessFunction < K, I, O > also provides two additional methods:

  • Processelement (I value, context CTX, Collector < o > out). Each element in the flow will call this method, and the call result will be output in the Collector data type. Context can access the timestamp of the element, the key of the element, and the TimerService time service. Context can also output the results to other streams (side outputs).
  • Ontimer (long timestamp, OnTimerContext CTX, collector < o > out) is a callback function. Called when a previously registered timer is triggered. The parameter timestamp is the trigger timestamp set by the timer. Collector is a collection of output results. OnTimerContext, like the Context parameter of processElement, provides some information about the Context, such as the time information triggered by the timer (event time or processing time).

Test code

Set a timer to give prompt information in the 5s after obtaining data

package processfunction;

import apitest.beans.SensorReading;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

/**
 * @author : Ashiamd email: ashiamd@foxmail.com
 * @date : 2021/2/3 12:30 AM
 */
public class ProcessTest1_KeyedProcessFunction {
  public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    // socket text stream
    DataStream<String> inputStream = env.socketTextStream("localhost", 7777);

    // Convert to SensorReading type
    DataStream<SensorReading> dataStream = inputStream.map(line -> {
      String[] fields = line.split(",");
      return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
    });

    // Test KeyedProcessFunction, group first, and then customize processing
    dataStream.keyBy("id")
      .process( new MyProcess() )
      .print();

    env.execute();
  }

  // Implement custom processing functions
  public static class MyProcess extends KeyedProcessFunction<Tuple, SensorReading, Integer> {
    ValueState<Long> tsTimerState;

    @Override
    public void open(Configuration parameters) throws Exception {
      tsTimerState =  getRuntimeContext().getState(new ValueStateDescriptor<Long>("ts-timer", Long.class));
    }

    @Override
    public void processElement(SensorReading value, Context ctx, Collector<Integer> out) throws Exception {
      out.collect(value.getId().length());

      // context
      // Timestamp of the element currently being processed or timestamp of a firing timer.
      ctx.timestamp();
      // Get key of the element being processed.
      ctx.getCurrentKey();
      //            ctx.output();
      ctx.timerService().currentProcessingTime();
      ctx.timerService().currentWatermark();
      // Triggered after a 5 second delay of 5 processing times
      ctx.timerService().registerProcessingTimeTimer( ctx.timerService().currentProcessingTime() + 5000L);
      tsTimerState.update(ctx.timerService().currentProcessingTime() + 1000L);
      //            ctx.timerService().registerEventTimeTimer((value.getTimestamp() + 10) * 1000L);
      // Deletes the timer triggered at the specified time
      //            ctx.timerService().deleteProcessingTimeTimer(tsTimerState.value());
    }

    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<Integer> out) throws Exception {
      System.out.println(timestamp + " Timer trigger");
      ctx.getCurrentKey();
      //            ctx.output();
      ctx.timeDomain();
    }

    @Override
    public void close() throws Exception {
      tsTimerState.clear();
    }
  }
}

9.2 TimerService and timers

The TimerService objects held by Context and OnTimerContext have the following methods:

  • long currentProcessingTime()   Returns the current processing time

  • long currentWatermark()   Returns the timestamp of the current watermark

  • void registerProcessingTimeTimer( long timestamp)   The timer of the processing time of the current key will be registered. When the processing time reaches the timing time, the timer is triggered.

  • void registerEventTimeTimer(long timestamp)   The event time timer of the current key will be registered. When the Watermark water mark is greater than or equal to the time registered by the timer, the timer is triggered to execute the callback function.

  • void deleteProcessingTimeTimer(long timestamp)   Delete the previously registered processing time timer. If there is no timer with this timestamp, it will not be executed.

  • void deleteEventTimeTimer(long timestamp)   Delete the previously registered event time timer. If there is no timer with this timestamp, it will not be executed.

​   When the timer timer is triggered, the callback function onTimer() is executed. Note that the timer timer can only be used on keyed streams.

Test code

The following example illustrates how KeyedProcessFunction operates KeyedStream.

Demand: monitor the temperature value of the temperature sensor. If the temperature value rises continuously within 10 seconds (processing time), it will give an alarm.

  • java code

    package processfunction;
    
    import apitest.beans.SensorReading;
    import org.apache.flink.api.common.state.ValueState;
    import org.apache.flink.api.common.state.ValueStateDescriptor;
    import org.apache.flink.api.common.time.Time;
    import org.apache.flink.configuration.Configuration;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
    import org.apache.flink.util.Collector;
    
    /**
     * @author : Ashiamd email: ashiamd@foxmail.com
     * @date : 2021/2/3 1:02 AM
     */
    public class ProcessTest2_ApplicationCase {
    
      public static void main(String[] args) throws Exception {
        // Create execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // Set the parallelism to 1
        env.setParallelism(1);
        // Get data from socket
        DataStream<String> inputStream = env.socketTextStream("localhost", 7777);
        // Convert data to SensorReading type
        DataStream<SensorReading> sensorReadingStream = inputStream.map(line -> {
          String[] fields = line.split(",");
          return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });
        // If there is a continuous temperature rise within 10s, an alarm will be given
        sensorReadingStream.keyBy(SensorReading::getId)
          .process(new TempConsIncreWarning(Time.seconds(10).toMilliseconds()))
          .print();
        env.execute();
      }
    
      // If there is a continuous temperature rise within 10s, an alarm will be given
      public static class TempConsIncreWarning extends KeyedProcessFunction<String, SensorReading, String> {
    
        public TempConsIncreWarning(Long interval) {
          this.interval = interval;
        }
    
        // Time interval of alarm (alarm if the temperature continues to rise within the interval)
        private Long interval;
    
        // Last temperature value
        private ValueState<Double> lastTemperature;
        // Trigger time of the last timer (alarm time)
        private ValueState<Long> recentTimerTimeStamp;
    
        @Override
        public void open(Configuration parameters) throws Exception {
          lastTemperature = getRuntimeContext().getState(new ValueStateDescriptor<Double>("lastTemperature", Double.class));
          recentTimerTimeStamp = getRuntimeContext().getState(new ValueStateDescriptor<Long>("recentTimerTimeStamp", Long.class));
        }
    
        @Override
        public void close() throws Exception {
          lastTemperature.clear();
          recentTimerTimeStamp.clear();
        }
    
        @Override
        public void processElement(SensorReading value, Context ctx, Collector<String> out) throws Exception {
          // Current temperature value
          double curTemp = value.getTemperature();
          // Last temperature (if not, set to current temperature)
          double lastTemp = lastTemperature.value() != null ? lastTemperature.value() : curTemp;
          // Timer status value (timestamp)
          Long timerTimestamp = recentTimerTimeStamp.value();
    
          // If current temperature > last temperature and no alarm timer is set
          if (curTemp > lastTemp && null == timerTimestamp) {
            long warningTimestamp = ctx.timerService().currentProcessingTime() + interval;
            ctx.timerService().registerProcessingTimeTimer(warningTimestamp);
            recentTimerTimeStamp.update(warningTimestamp);
          }
          // If the current temperature is less than the last temperature and the alarm timer is set, the timer is cleared
          else if (curTemp <= lastTemp && timerTimestamp != null) {
            ctx.timerService().deleteProcessingTimeTimer(timerTimestamp);
            recentTimerTimeStamp.clear();
          }
          // Update saved temperature values
          lastTemperature.update(curTemp);
        }
    
        // Timer task
        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
          // The alarm is triggered and the timer status value is cleared
          out.collect("sensor" + ctx.getCurrentKey() + "Continuous temperature value" + interval + "ms rise");
          recentTimerTimeStamp.clear();
        }
      }
    }
  • Start the local socket and input data

    nc -lk 7777
    • input

      sensor_1,1547718199,35.8
      sensor_1,1547718199,34.1
      sensor_1,1547718199,34.2
      sensor_1,1547718199,35.1
      sensor_6,1547718201,15.4
      sensor_7,1547718202,6.7
      sensor_10,1547718205,38.1
      sensor_10,1547718205,39  
      sensor_6,1547718201,18  
      sensor_7,1547718202,9.1
    • output

      sensor sensor_1 Temperature value continuous 10000 ms rise
       sensor sensor_10 Temperature value continuous 10000 ms rise
       sensor sensor_6 Temperature value continuous 10000 ms rise
       sensor sensor_7 Temperature value continuous 10000 ms rise

9.3 side output

  • A data can be contained by multiple windows. Only when it is not contained by any window (after all windows containing the data are closed) will it be thrown to the side output stream.
  • In short, if a data is lost to the side output stream, all window s containing the data are closed because they have exceeded the "allowable lateness time", and the new lateness data can only be lost to the side output stream!
  • The output of most operators of DataStream API is a single output, that is, a stream of some data type. In addition to the split operator, a stream can be divided into multiple streams with the same data types.

  • The side outputs function of processfunction can generate multiple streams, and the data types of these streams can be different.

  • A side output can be defined as an OutputTag[X] object. X is the data type of the output stream.

  • processfunction can emit an event to one or more side outputs through the Context object.

Test code

Scenario: temperature > = 30 is put into high-temperature flow output, otherwise put into low-temperature flow output

  • java code

    package processfunction;
    
    import apitest.beans.SensorReading;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.ProcessFunction;
    import org.apache.flink.util.Collector;
    import org.apache.flink.util.OutputTag;
    
    /**
     * @author : Ashiamd email: ashiamd@foxmail.com
     * @date : 2021/2/3 2:07 AM
     */
    public class ProcessTest3_SideOuptCase {
      public static void main(String[] args) throws Exception {
        // Create execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // Set parallelism = 1
        env.setParallelism(1);
        // Read data from local socket
        DataStream<String> inputStream = env.socketTextStream("localhost", 7777);
        // Convert to SensorReading type
        DataStream<SensorReading> dataStream = inputStream.map(line -> {
          String[] fields = line.split(",");
          return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
        });
    
        // Define an OutputTag to represent the low temperature flow of side output flow
        // An OutputTag must always be an anonymous inner class
        // so that Flink can derive a TypeInformation for the generic type parameter.
        OutputTag<SensorReading> lowTempTag = new OutputTag<SensorReading>("lowTemp"){};
    
        // Test ProcessFunction and customize the side output stream to realize shunting operation
        SingleOutputStreamOperator<SensorReading> highTempStream = dataStream.process(new ProcessFunction<SensorReading, SensorReading>() {
          @Override
          public void processElement(SensorReading value, Context ctx, Collector<SensorReading> out) throws Exception {
            // Judge the temperature, if it is greater than 30 ℃, the high-temperature flow is output to the mainstream; Less than low temperature flow output to side output flow
            if (value.getTemperature() > 30) {
              out.collect(value);
            } else {
              ctx.output(lowTempTag, value);
            }
          }
        });
    
        highTempStream.print("high-temp");
        highTempStream.getSideOutput(lowTempTag).print("low-temp");
    
        env.execute();
      }
    }
  • Local boot socket

    • input

      sensor_1,1547718199,35.8
      sensor_6,1547718201,15.4
      sensor_7,1547718202,6.7
      sensor_10,1547718205,38.1
    • output

      high-temp> SensorReading{id='sensor_1', timestamp=1547718199, temperature=35.8}
      low-temp> SensorReading{id='sensor_6', timestamp=1547718201, temperature=15.4}
      low-temp> SensorReading{id='sensor_7', timestamp=1547718202, temperature=6.7}
      high-temp> SensorReading{id='sensor_10', timestamp=1547718205, temperature=38.1}

9.4 CoProcessFunction

  • For two input streams, the DataStream API provides low-level operations such as CoProcessFunction. CoProcessFunction provides methods to operate each input stream:   processElement1() and processElement2().

  • Similar to ProcessFunction, both methods are called through the Context object. This Context object can access event data, timer timestamp, TimerService, and side outputs.

  • CoProcessFunction also provides onTimer() callback function.

Posted by byronbailey on Mon, 01 Nov 2021 22:39:01 -0700