Detailed explanation of watermark in flink EventTime

Keywords: Apache Scala Java network

window operation on data

In the window operation of flow data, when event time passes the window time, it will close the window. Then in the actual production environment, due to various reasons such as the network, late data may appear, resulting in data disorder. As shown in the figure below, the number in the circle represents the time. At this time, if you perform a 5-second scrolling window operation on the data, when the data in the 6th second enters, the system will say that the 0-5 second window is closed, and the 4 second data will be lost.

In this case, besides calling allowed latency and side output stream, you can also use watermark. The use of watermark can deal with disordered data and define the window closing time point through watermark.

What is watermark

From the literal meaning of watermark, it can be understood as water level line, which can be understood as the scale of water level in real life, and in the semantics of flink, watermark can also be understood as a time scale. Take a simple example. A certain route starts at 10 a.m. every day, and there is a flight every half hour. Each departure point can be understood as a watermark, and passengers can be understood as events. If all passengers arrive before take-off, they can normally take the plane to their destination, which is the water level of orderly events. As shown in the figure below, 1234567 has reached its own water level. If a passenger is late, the plane will not wait for him to fly away directly, which is more like the water level line of disorderly sequence events, as shown in Figure 45 below. If 45 is not processed, these data will be lost.

Ordered time line

Disordered time water level

watermark test

package com.stanley.flink

  * An order sample class
  * @param timestamp Event event
  * @param category Commodity category
  * @param price Price
case class Order(timestamp:Long,category:String,price:Double)
package com.stanley.flink

import{Tuple, Tuple1}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

object WatermarkTest {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //Set to eventTime
    //Parallelism set to 1
    //Create text flow
    val inputStream:DataStream[String] = env.socketTextStream("node1",9999)
    //Convert to order sample class
    val dataStream:DataStream[Order] =>{
      val order = str.split(",")
      new Order(order(0).toLong,order(1),order(2).toDouble)})
    val outputStream:DataStream[Order] = dataStream.
    //Calling watermark, the maximum out of order event is set to 0
      assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Order](Time.seconds(0)) {
      override def extractTimestamp(t: Order): Long = t.timestamp*1000L
    }).keyBy("category").timeWindow(Time.seconds(5)).apply(new MyWindowFunction)

//Customize Windowfunction to aggregate categories and return the largest event with time stamp as window
class MyWindowFunction() extends WindowFunction[Order,Order,Tuple,TimeWindow]{
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Order], out: Collector[Order]): Unit = {
    val timestamp = window.maxTimestamp()
    var sum:Double = 0
    for (elem <- input) {
      sum = elem.price+sum
    //java Tuple transformation
    val category:String = key.asInstanceOf[Tuple1[String]].f0
    out.collect(new Order(timestamp,category,sum))

Enter the following four pieces of data. When the event event is 390, the window will close. The events in the time stamp 385-389 will be classified and aggregated. The events in 390 will be the events of the next window in the console. At this time, the watermark is 390. Input the events before 390 again, and these data will be lost if not processed

Only 390 food is aggregated, 389 data is lost

How to process late data through watermark

The function of watermark can delay the arrival time of watermark by passing in a time. From the source code, we can see that watermark is the current event time minus the maximum disorder time

Modify the maximum out of order time, delay watermark,

Input the same data again. At this time, the console does not print the result data, because the maximum out of order time is set to 1 second, and the watermark is 389

When the event time changes to 391, the watermark of 390 arrives, and the time window of 385-389 closes

How to define the downstream watermark when multitasking is parallel

When multiple tasks in the upstream are parallel, the lowest watermark in the upstream of watermark will prevail

Set the parallelism of the code to 2

Input the same data again. The watermark of slot2 has reached 390, while the watermark of slot1 is still 389, so the downstream watermark is still 389, so there is no data output

When the event of 392 is input again, when the event is polled to slot1, the watermark of slot1 becomes 391, the watermark of slot2 is 390, and the downstream watermark is 390, which causes the window to close. Input the result data

Posted by plex303 on Thu, 25 Jun 2020 22:46:08 -0700