Apache Flink Talk Series (12) - Time Interval(Time-windowed) JOIN

What did you say?

JOIN operator is the core operator of data processing. We introduced UnBounded's two-stream JOIN in Apache Flink Talk Series (09) - JOIN Operator earlier. We introduced single-stream and UDTF JOIN operation in Apache Flink Talk Series (10) - JOIN LATERAL. We also introduced single-stream and version table JOIN in Apache Flink Talk Series (11) - Temporal Table JOIN. This article will introduce single-stream and version table JOIN.JOIN operation - Time Interval(Time-windowed)JOIN, which divides data into time dimensions on an UnBounded data stream, is later called Interval JOIN.

practical problem

In the previous section, we introduced Flink support for various JOINs, so think about whether the JOINs described earlier for the following query requirements can be met?The requirements are described as follows:

For example, there is an order form Orders(orderId, productName, orderTime) and a payment form Payment(orderId, payType, payTime).Suppose we want to count the order information for payment within a single hour.

Traditional Database Solutions

It is very simple to complete the above requirements in the traditional Liu database, and the query sql is as follows:

SELECT 
  o.orderId,
  o.productName,
  p.payType,
  o.orderTime，
  payTime
FROM
  Orders AS o JOIN Payment AS p ON 
  o.orderId = p.orderId AND p.payTime >= orderTime AND p.payTime < orderTime + 3600 // second

The above queries perfectly fulfill the query requirements, so how do you accomplish them in Apache Flink?

Apache Flink Solution

UnBounded Dual Stream JOIN

The above query requirements make it easy to think that UnBounded's two-stream JOIN is introduced using Apache Flink Talk Series (09) - JOIN Operator, with the following SQL statements:

 SELECT 
    o.orderId,
    o.productName,
    p.payType,
    o.orderTime，
    payTime 
  FROM
    Orders AS o JOIN Payment AS p ON 
    o.orderId = p.orderId AND p.payTime >= orderTime AND p.payTime as timestamp < TIMESTAMPADD(SECOND, 3600, orderTime)

UnBounded dual-stream JOIN solves the above problem. What does this example have to do with the Interval JOIN described in this article?

Performance issues

Although we can solve the above problem with UnBounded JOIN, a careful analysis of user requirements reveals that this requirement scenario does not require long-term storage of order information and payment information, for example, orders of 2018-12-27 14:22:22 only need to be maintained for one hour, because orders of more than one hour are invalid if they are not paid.Similarly, payment information does not need to be maintained over time. Order payment information for 2018-12-27 14:22:22 would not need to be saved in State if it arrived after 2018-12-27 15:22:22.For UnBounded's two-stream JOIN, we keep the data in State as follows:

This underlying implementation has an unnecessary performance penalty for current requirements.Therefore, it is necessary to develop a new JOIN (Interval JOIN) method that cleans up State s to fulfill the above query requirements at high performance.

Functional Extensions

Current UnBounded dual-stream JOIN s are followed by Window Aggregate s, which are no longer possible with Event-Time.That is, the following statement is not supported on Apache Flink:

 SELECT COUNT(*) FROM (
  SELECT 
   ...,
   payTime
   FROM Orders AS o JOIN Payment AS p ON 
    o.orderId = p.orderId 
  ) GROUP BY TUMBLE(payTime, INTERVAL '15' MINUTE)

Because it is not guaranteed in UnBounded's two-stream JOIN that the payTime value must be greater than WaterMark (WaterMark related can be read <>). Apache Flink's Interval JOIN can be followed by Event-Time's Window Aggregate.

Interval JOIN

To meet these requirements and address performance and functionality extensions, Apache Flink started developing Time-windowed Join, which is the Interval JOIN described in this article, at 1.4.Next we describe the syntax, semantics and implementation of Interval JOIN in detail.

What is Interval JOIN

Interval JOIN is a Bounded JOIN relative to UnBounded's two-stream JOIN.It is the JOIN of each data in each stream and in a different time zone on another stream.Time-windowed JOIN corresponding to the official Apache Flink document (previously called Time-Windowed JOIN until release-1.7).

Interval JOIN Syntax

SELECT ... FROM t1 JOIN t2  ON t1.key = t2.key AND TIMEBOUND_EXPRESSION

TIMEBOUND_EXPRESSION has two ways of writing, as follows:

L.time between LowerBound(R.time) and UpperBound(R.time)
R.time between LowerBound(L.time) and UpperBound(L.time)
A comparison expression with a time attribute (L.time/R.time).

Interval JOIN Semantics

The semantics of an Interval JOIN is that each data corresponds to an Interval data range, such as an order form Orders(orderId, productName, orderTime) and a payment form Payment(orderId, payType, payTime).Suppose we want to count the order information for payment within the next hour.The SQL query is as follows:

SELECT 
  o.orderId,
  o.productName,
  p.payType,
  o.orderTime，
  cast(payTime as timestamp) as payTime
FROM
  Orders AS o JOIN Payment AS p ON 
  o.orderId = p.orderId AND 
  p.payTime BETWEEN orderTime AND 
  orderTime + INTERVAL '1' HOUR

Orders Order Data

orderId	productName	orderTime
001	iphone	2018-12-26 04:53:22.0
002	mac	2018-12-26 04:53:23.0
003	book	2018-12-26 04:53:24.0
004	cup	2018-12-26 04:53:38.0

Payment data

orderId	payType	payTime
001	alipay	2018-12-26 05:51:41.0
002	card	2018-12-26 05:53:22.0
003	card	2018-12-26 05:53:30.0
004	alipay	2018-12-26 05:53:31.0

The semantically expected result is that information with order id 003 does not appear in the result table because the order time 2018-12-26 04:53:24.0 and the payment time 2018-12-26 05:53:30.0 exceed one hour of payment.
Then the expected result information is as follows:

orderId	productName	payType	orderTime	payTime
001	iphone	alipay	2018-12-26 04:53:22.0	2018-12-26 05:51:41.0
002	mac	card	2018-12-26 04:53:23.0	2018-12-26 05:53:22.0
004	cup	alipay	2018-12-26 04:53:38.0	2018-12-26 05:53:31.0

This makes the Id 003 order invalid and allows you to update your inventory to continue selling.

Next, we illustrate the semantics of Interval JOIN visually, and we need to make a slight change to the sample requirements above: An order can be prepaid (whether reasonable or not, we are just explaining the semantics) that is, an hour's payment before and after the order is valid.The SQL statement is as follows:

SELECT
  ...
FROM
  Orders AS o JOIN Payment AS p ON
  o.orderId = p.orderId AND
  p.payTime BETWEEN orderTime - INTERVAL '1' HOUR AND
  orderTime + INTERVAL '1' HOUR

Such a query semantics diagram is as follows:

There are several key points in the figure above, as follows:

The interval of the data JOIN - e.g. orders with Order time 3 will have JOIN in the payment time [2,4] interval.
WaterMark - For example, if the illustration Order last data time is 3 and Payment last data time is 5, then WaterMark is generated by subtracting UpperBound from the actual minimum, that is, Min(3,5)-1 = 2
Expired data - For performance and storage reasons, to clear expired data, such as when WaterMark is 2 and data before 2 expires, it can be cleared.

Interval JOIN Implementation Principles

Because both Interval JOIN s and dual-stream JOINs store data on the left and right sides, State is still used for data storage in the underlying implementation.Stream computing is characterized by the constant inflow of data, we can do incremental calculation, that is, we can do JOIN calculation for each data inflow.We also illustrate the internal computing logic with specific examples and illustrations as follows:

A brief explanation of the processing logic for each record is as follows:

The actual internal logic will be more complex than the description, and you can understand the internal principles as outlined above.

Sample Code

We'll also share the full code with you, using the example of orders and payments, as follows (code based on flink-1.7.0):

import java.sql.Timestamp

import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row

import scala.collection.mutable

object SimpleTimeIntervalJoin {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val tEnv = TableEnvironment.getTableEnvironment(env)
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    // Construct order data
    val ordersData = new mutable.MutableList[(String, String, Timestamp)]
    ordersData.+=(("001", "iphone", new Timestamp(1545800002000L)))
    ordersData.+=(("002", "mac", new Timestamp(1545800003000L)))
    ordersData.+=(("003", "book", new Timestamp(1545800004000L)))
    ordersData.+=(("004", "cup", new Timestamp(1545800018000L)))

    // Construct payment form
    val paymentData = new mutable.MutableList[(String, String, Timestamp)]
    paymentData.+=(("001", "alipay", new Timestamp(1545803501000L)))
    paymentData.+=(("002", "card", new Timestamp(1545803602000L)))
    paymentData.+=(("003", "card", new Timestamp(1545803610000L)))
    paymentData.+=(("004", "alipay", new Timestamp(1545803611000L)))
    val orders = env
      .fromCollection(ordersData)
      .assignTimestampsAndWatermarks(new TimestampExtractor[String, String]())
      .toTable(tEnv, 'orderId, 'productName, 'orderTime.rowtime)
    val ratesHistory = env
      .fromCollection(paymentData)
      .assignTimestampsAndWatermarks(new TimestampExtractor[String, String]())
      .toTable(tEnv, 'orderId, 'payType, 'payTime.rowtime)

    tEnv.registerTable("Orders", orders)
    tEnv.registerTable("Payment", ratesHistory)

    var sqlQuery =
      """
        |SELECT
        |  o.orderId,
        |  o.productName,
        |  p.payType,
        |  o.orderTime,
        |  cast(payTime as timestamp) as payTime
        |FROM
        |  Orders AS o JOIN Payment AS p ON o.orderId = p.orderId AND
        | p.payTime BETWEEN orderTime AND orderTime + INTERVAL '1' HOUR
        |""".stripMargin
    tEnv.registerTable("TemporalJoinResult", tEnv.sqlQuery(sqlQuery))

    val result = tEnv.scan("TemporalJoinResult").toAppendStream[Row]
    result.print()
    env.execute()
  }

}

class TimestampExtractor[T1, T2]
  extends BoundedOutOfOrdernessTimestampExtractor[(T1, T2, Timestamp)](Time.seconds(10)) {
  override def extractTimestamp(element: (T1, T2, Timestamp)): Long = {
    element._3.getTime
  }
}

The results are as follows:

Subsection

Starting from the actual business requirements scenario, this article describes that the same business requirements can be implemented either by using UnBounded dual-stream JOIN or Time Interval JOIN. Time Interval JOIN has better performance than UnBounded dual-stream JOIN and can be computed by Window Aggregate operator after Interval JOIN.Then the syntax, semantics and implementation principle of Interval JOIN are introduced. Finally, the complete sample code of order and payment is shared.I hope this article will give you a specific understanding of Apache Flink Time Interval JOIN!

Author: Golden Bamboo

Source: Ali Yunqi Community

Source link: https://yq.aliyun.com/articles/686809

1. You need to find an organization and learn to communicate with others.Interested students can add QQ group: 732021751.

2. Through reading and learning, unfortunately, Flink is not a systematic and practical book yet, and we expect to wait a few more times.

3. Watch the sharing video of Flink Old Birds. This is really an option for students who want to learn Flink quickly and have some project experience.At present, the most popular IT learning platforms should be "Flink Big Data Project Actual" set of videos, interesting - >. Stamp this link.

Posted by quickstopman on Thu, 02 May 2019 11:30:37 -0700

Programmer Group