ProcessFunction: the lowest level API of Flink

Keywords: github SQL Java

Some operators and functions mentioned above can perform some time operations, but cannot obtain the current Processing Time or Watermark timestamp of the operator, which is simple to call but relatively limited in function. If you want to get the time stamp of Watermark in the data flow, or shuttle back and forth in time, you need to use ProcessFunction series functions, which are the lowest level API in the Flink system, and provide more fine-grained operation permissions for the data flow. Flink SQL is implemented based on these functions, which are also used in some business scenarios that need to be highly personalized.

At present, this series of functions mainly include KeyedProcessFunction, ProcessFunction, CoProcessFunction, KeyedCoProcessFunction, ProcessJoinFunction and ProcessWindowFunction. These functions have their own emphases, but their core functions are similar, mainly including two points:

  • State: we can access and update Keyed State in these functions.

  • Timer: set timer like setting alarm clock. We can design more complex business logic in time dimension.

For status introduction, please refer to my article: Flink state management details, Here we will focus on other features of using ProcessFunction. All the code in this article has been uploaded to my GitHub: https://github.com/luweizheng/flick-tutorials

How to use Timer

We can understand Timer as an alarm clock, and register a future time in Timer before use. When this time arrives, the alarm clock will "ring", and the program will execute a callback function, which will execute certain business logic. Take KeyedProcessFunction as an example to introduce the registration and use of Timer.

ProcessFunction has two important interfaces processElement and onTimer. The Java signature of processElement function in the source code is as follows:

// Processing an element in a data flow
public abstract void processElement(I value, Context ctx, Collector<O> out)

The processElement method processes an element in the data stream and outputs it through collector < o >. Context is different from FlatMapFunction and other common functions. Developers can obtain time stamps, access TimerService and set timers through context.

Another interface is onTimer:

// Callback function after time
public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out)

This is a callback function. When the alarm time is up, Flink will call onTimer and execute some business logic. There is also a parameter OnTimerContext, which actually inherits the previous Context, almost the same as the Context.

The main logic of using Timer is:

  1. Register a future timestamp t with Context in the processElement method. The semantics of this timestamp can be Processing Time or Event Time, which can be selected according to business requirements.
  2. In the onTimer method, some logic is implemented. At the time of t, the onTimer method is called automatically.

From Context, we can get a TimerService, which is an interface to access timestamps and timers. We can register timers through two methods: Context.timerService.registerProcessingTimeTimer or ` ` Context.timerService.registerEventTimeTimer. We only need to pass in a timestamp. We can delete the previously registered timers through Context.timerService.deleteProcessingTimeTimer and Context.timerService.deleteEventTimeTimer. In addition, you can get the current timestamp from it: Context.timerService.currentProcessingTime and Context.timerService.currentWatermark '. From the function name, we can see that there are two kinds of functions. The two methods correspond to two kinds of time semantics.

Note that we can only register timers on KeyedStream. Different timers can be registered with different time stamps under each Key, but only one Timer can be registered for each time stamp of each Key. If you want to apply Timer on a DataStream, you can map all the data to a forged Key, but then all the data will flow into an operator subtask.

Once again, we use the stock trading scenario to explain how to use Timer. A stock transaction includes stock code, time stamp, stock price and trading volume. Now we want to see if a stock keeps rising for 10 seconds. If it keeps rising, send a prompt.

case class StockPrice(symbol: String, ts: Long, price: Double, volume: Int)

class IncreaseAlertFunction(intervalMills: Long)
extends KeyedProcessFunction[String, StockPrice, String] {

  // Status: save the last trading price of a stock
  lazy val lastPrice: ValueState[Double] =
  getRuntimeContext.getState(
    new ValueStateDescriptor[Double]("lastPrice", Types.of[Double])
  )

  // Status: save the timer timestamp of a stock
  lazy val currentTimer: ValueState[Long] =
  getRuntimeContext.getState(
    new ValueStateDescriptor[Long]("timer", Types.of[Long])
  )

  override def processElement(stock: StockPrice,
                              context: KeyedProcessFunction[String, StockPrice, String]#Context,
                              out: Collector[String]): Unit = {

    // Get the data in the lastPrice state, which will be initialized to 0 when used for the first time
    val prevPrice = lastPrice.value()
    // Update lastPrice
    lastPrice.update(stock.price)
    val curTimerTimestamp = currentTimer.value()
    if (prevPrice == 0.0) {
      // For the first time use, do nothing
    } else if (stock.price < prevPrice) {
      // If the price of the new inflow stock decreases, delete the Timer, otherwise the Timer will remain
      context.timerService().deleteEventTimeTimer(curTimerTimestamp)
      currentTimer.clear()
    } else if (stock.price >= prevPrice && curTimerTimestamp == 0) {
      // If the price of new inflow stock rises
      // A curTimerTimestamp of 0 indicates that the current Timer state is empty, and there is no corresponding Timer
      // New Timer = current time + interval
      val timerTs = context.timestamp() + intervalMills

      val formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
      context.timerService().registerEventTimeTimer(timerTs)
      // Update the currentTimer status. The subsequent data will read the currentTimer and make relevant judgments
      currentTimer.update(timerTs)
    }
  }

  override def onTimer(ts: Long,
                       ctx: KeyedProcessFunction[String, StockPrice, String]#OnTimerContext,
                       out: Collector[String]): Unit = {

    val formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")

    out.collect("time: " + formatter.format(ts) + ", symbol: '" + ctx.getCurrentKey +
                " monotonically increased for " + intervalMills + " millisecond.")
    // Clear currentTimer state
    currentTimer.clear()
  }
}

In the main logic, call KeyedProcessFunction through the following process operator:

val inputStream: DataStream[StockPrice] = ...
val warnings = inputStream
      .keyBy(stock => stock.symbol)
      // Call the process function
      .process(new IncreaseAlertFunction(10000))

When Checkpoint, Timer will be saved along with other state data. If you use Processing Time semantics to set some timers, the timestamps will expire at the time of restart, and those callback functions will be called and executed immediately.

SideOutput

Another feature of ProcessFunction is that one part of data can be sent to another stream, and the data types of the two streams can be different. We mark another stream by OutputTag[T]. In ProcessFunction, filter out some kind of data as follows:

class IncreaseAlertFunction(intervalMills: Long) extends KeyedProcessFunction[String, Stock, String] {

  override def processElement(stock: Stock,
                              context: KeyedProcessFunction[String, Stock, String]#Context,
                              out: Collector[String]): Unit = {

    // Other business logic
    // Define an OutputTag, and Stock is the data type of this SideOutput stream
    val highVolumeOutput: OutputTag[Stock] = new OutputTag[Stock]("high-volume-trade")

    if (stock.volume > 1000) {
      // Filter out the Stock and send it to the OutputTag
      context.output(highVolumeOutput, stock)
    }
  }
}

In the main logic, the side output is obtained by the following methods:

// Collect SideOutput
val outputTag: OutputTag[Stock] = OutputTag[Stock]("high-volume-trade")
val sideOutputStream: DataStream[Stock] = mainStream.getSideOutput(outputTag)

From this example, we can see that the output type of KeyedProcessFunction is String, while the output type of SideOutput is Stock, which can be different.

Using ProcessFunction to realize Join

If you want to realize the Join of two data streams in a finer granularity, you can use CoProcessFunction or KeyedCoProcessFunction. Both of these functions have processElement1 and processElement2 methods to process each element of the first and second data flows, respectively. The data types and output types of the two data flows can be different from each other. Although the data comes from two different streams, they can share the same state, so you can refer to the following logic to implement the Join:

  • Create one or more states that can be accessed by both data flows. Take state a as an example.
  • The processElement1 method processes the first data flow, updating state a.
  • The processElement2 method processes the second data flow and generates the corresponding output based on the data in state a.

We will discuss the stock price together with two data streams of media evaluation. Suppose there is a media evaluation data stream for a stock, which contains the positive and negative evaluation of the stock. The two data streams flow into KeyedCoProcessFunction together. processElement2 method processes the incoming media data, updates the media evaluation to the state mediaState. processElement1 method processes the incoming stock transaction data, obtains the mediaState state, and generates a new data stream. The two methods deal with two data streams respectively, share a state, and communicate through the state.

In the main logic, we connect the two data streams, and then keyBy them according to the stock code, and then use the process operator:

val stockPriceRawStream: DataStream[StockPrice] = ...
val mediaStatusStream: DataStream[Media] = ...
val warnings = stockStream.connect(mediaStream)
      .keyBy(0, 0)
      // Call the process function
      .process(new AlertProcessFunction())

The specific implementation of KeyedCoProcessFunction:

class JoinStockMediaProcessFunction extends KeyedCoProcessFunction[String, StockPrice, Media, StockPrice] {

  // mediaState
  private var mediaState: ValueState[String] = _

  override def open(parameters: Configuration): Unit = {

    // Get status from RuntimeContext
    mediaState = getRuntimeContext.getState(
      new ValueStateDescriptor[String]("mediaStatusState", classOf[String]))

  }

  override def processElement1(stock: StockPrice,
                               context: KeyedCoProcessFunction[String, StockPrice, Media, StockPrice]#Context,
                               collector: Collector[StockPrice]): Unit = {

    val mediaStatus = mediaState.value()
    if (null != mediaStatus) {
      val newStock = stock.copy(mediaStatus = mediaStatus)
      collector.collect(newStock)
    }

  }

  override def processElement2(media: Media,
                               context: KeyedCoProcessFunction[String, StockPrice, Media, StockPrice]#Context,
                               collector: Collector[StockPrice]): Unit = {
    // The second stream updates the mediaState
    mediaState.update(media.status)
  }

}

This example is relatively simple and does not use Timer. In actual business scenarios, Timer is generally used to clear expired state. Many Internet APP machine learning sample splicing may depend on this function: the machine learning feature of the server is generated in real time, and the user's behavior on the APP is generated after interaction. The two belong to two different data flows. According to this logic, the two data flows can be spliced together to get the next round of machine learning sample data faster through splicing. The intermediate data of the two data flows is placed in the state. In order to avoid the infinite growth of the state, it is necessary to use Timer to clear the expired state.

Note that when using Event Time, both data streams must be set with Watermark, and only the Event Time and Watermark of one stream can be set. The Timer function cannot be used in CoProcessFunction and KeyedCoProcessFunction, because the process operator cannot determine what time he should process data.

Published 29 original articles, won praise 2, visited 1356
Private letter follow

Posted by eddjc on Wed, 05 Feb 2020 02:21:22 -0800