The latest Flink series tutorials in 2021__ Flink advanced API

day04_Flink advanced API

Today's goal

  • Flink's four cornerstones
  • Flink Window operation
  • Flink Time - Time
  • Flink Watermark watermark mechanism
  • State management of Flink - keyed state and operator state

Flink's four cornerstones

  • Checkpoint - checkpoint, distributed consistency, data loss resolution, fault recovery, data storage is the global state, persistent in HDFS distributed file system
  • State - state, divided into Managed state and raw state; From the perspective of data structure, ValueState, ListState, MapState and BroadcastState
  • Time - time, EventTime, event time, Ingestion time, Process processing time
  • Window - window, time window and count window, TimeWindow, countwindow, sessionwindow

Window operation

  • Why do I need Window - Window The data is dynamic and unbounded, which requires the window to delimit the scope and convert the unbounded data into bounded and static data for calculation.

Window classification

  • Time - sort by time
    • Window level of time, day, hour, minute
    • Use more scrolling windows - tumbling window s and sliding windows - sliding windows
    • Scroll window. Window time is the same as sliding time
    • Sliding window, the sliding time is less than the window time;
    • Session window - session windows
  • Count - count for classification
    • Scroll count window
    • Sliding count window

How to use

windows case

Time window requirements

  • Count the number of vehicles passing traffic lights at each intersection in the last 5 seconds every 5 seconds - time-based scrolling window
  • Count the number of vehicles passing traffic lights at each intersection in the last 10 seconds every 5 seconds - time-based sliding window
package cn.itcast.flink.basestone;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

/**
 * Author itcast
 * Date 2021/6/18 15:00
 * Development steps
 * 1. Convert string 9,3 to CartInfo
 * 2. Use the scroll window to slide the window
 * 3. Grouping and aggregation
 * 4. Printout
 * 5. execution environment 
 */
public class WindowDemo01 {
    public static void main(String[] args) throws Exception {
        //1.env create flow execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //2. Read socket data source
        DataStreamSource<String> source = env.socketTextStream("192.168.88.161", 9999);
        //3. Convert 9,3 to CartInfo(9,3)
        DataStream<CartInfo> mapDS = source.map(new MapFunction<String, CartInfo>() {
            @Override
            public CartInfo map(String value) throws Exception {
                String[] kv = value.split(",");
                return new CartInfo(kv[0], Integer.parseInt(kv[1]));
            }
        });
        //4. Group according to sensorId and divide the scrolling window into 5 seconds, and sum on the window
        // Tumbling processing timewindows
        //Demand 1: count the number of vehicles passing traffic lights at each intersection / signal in the last 5 seconds every 5 seconds
        SingleOutputStreamOperator<CartInfo> result1 = mapDS.keyBy(t -> t.sensorId)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .sum("count");
        //Demand 2: count the number of vehicles passing traffic lights at each intersection / signal in the last 10 seconds every 5 seconds
        SingleOutputStreamOperator<CartInfo> result2 = mapDS.keyBy(t -> t.sensorId)
                .window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(5)))
                .sum("count");
        //5. Printout
        //result1.print();
        result2.print();
        //6.execute
        env.execute();
    }

    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class CartInfo {
        private String sensorId;//Signal lamp id
        private Integer count;//Number of vehicles passing the signal lamp
    }
}

Counting window requirements

  • Demand 1: count the number of cars passing through each intersection in the last 5 messages. Count every 5 times the same key appears - a scrolling window based on the number
  • Demand 2: count the number of cars passing through each intersection in the last five messages. The same key will be counted every three times - a sliding window based on the number
package cn.itcast.flink.basestone;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

/**
 * Author itcast
 * Date 2021/6/18 15:46
 * Desc TODO
 */
public class CountWindowDemo01 {
    public static void main(String[] args) throws Exception {
        //1.env create flow execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //2. Read socket data source
        DataStreamSource<String> source = env.socketTextStream("192.168.88.161", 9999);
        //3. Convert 9,3 to CartInfo(9,3)
        DataStream<WindowDemo01.CartInfo> mapDS = source.map(new MapFunction<String, WindowDemo01.CartInfo>() {
            @Override
            public WindowDemo01.CartInfo map(String value) throws Exception {
                String[] kv = value.split(",");
                return new WindowDemo01.CartInfo(kv[0], Integer.parseInt(kv[1]));
            }
        });
        // *Demand 1: count the number of cars passing through each intersection in the last five messages. Count every five times the same key appears -- a rolling window based on the number
        //        //countWindow(long size, long slide)
        SingleOutputStreamOperator<WindowDemo01.CartInfo> result1 = mapDS.keyBy(t -> t.getSensorId())
                .countWindow(5)
                .sum("count");
        // *Demand 2: count the number of cars passing through each intersection in the last five messages. The same key is counted every three times -- a sliding window based on the number
        SingleOutputStreamOperator<WindowDemo01.CartInfo> result2 = mapDS.keyBy(t -> t.getSensorId())
                .countWindow(5, 3)
                .sum("count");

        //Printout
        //result1.print();
        result2.print();
        //execution environment 
        env.execute();
    }
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class CartInfo {
        private String sensorId;//Signal lamp id
        private Integer count;//Number of vehicles passing the signal lamp
    }
}
package cn.itcast.flink.basestone;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

/**
 * Author itcast
 * Date 2021/6/18 15:46
 * Desc TODO
 */
public class CountWindowDemo01 {
    public static void main(String[] args) throws Exception {
        //1.env create flow execution environment
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        //2. Read socket data source
        DataStreamSource<String> source = env.socketTextStream("192.168.88.161", 9999);
        //3. Convert 9,3 to CartInfo(9,3)
        DataStream<WindowDemo01.CartInfo> mapDS = source.map(new MapFunction<String, WindowDemo01.CartInfo>() {
            @Override
            public WindowDemo01.CartInfo map(String value) throws Exception {
                String[] kv = value.split(",");
                return new WindowDemo01.CartInfo(kv[0], Integer.parseInt(kv[1]));
            }
        });
        // *Demand 1: count the number of cars passing through each intersection in the last five messages. Count every five times the same key appears -- a rolling window based on the number
        //        //countWindow(long size, long slide)
        SingleOutputStreamOperator<WindowDemo01.CartInfo> result1 = mapDS.keyBy(t -> t.getSensorId())
                .countWindow(5)
                .sum("count");
        // *Demand 2: count the number of cars passing through each intersection in the last five messages. The same key is counted every three times -- a sliding window based on the number
        SingleOutputStreamOperator<WindowDemo01.CartInfo> result2 = mapDS.keyBy(t -> t.getSensorId())
                .countWindow(5, 3)
                .sum("count");

        //Printout
        //result1.print();
        result2.print();
        //execution environment 
        env.execute();
    }
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class CartInfo {
        private String sensorId;//Signal lamp id
        private Integer count;//Number of vehicles passing the signal lamp
    }
}

Flink - Time and watermark

Time - time

Watermark mechanism - watermark

  • It mainly solves the problem of data delay
  • Watermark (timestamp) = event time - maximum allowable delay time
  • Window trigger condition Watermark time > = the end time of the window triggers the calculation

demand

There is order data in the format of: (order ID, user ID, timestamp / event time, order amount)

It is required to calculate the total order amount of each user within 5 seconds every 5s

Watermark is added to solve the problems of data delay and data disorder (up to 3 seconds).

package cn.itcast.flink.basestone;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;
import java.util.Random;
import java.util.UUID;

/**
 * Author itcast
 * Date 2021/6/18 16:54
 * Desc TODO
 */
public class WatermarkDemo01 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //Set the property ProcessingTime. The new version sets EventTime by default
        //env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        //2.Source creates an Order class orderId:String userId:Integer money:Integer eventTime:Long
        DataStreamSource<Order> source = env.addSource(new SourceFunction<Order>() {
            boolean flag = true;
            Random rm = new Random();

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                while (flag) {
                    ctx.collect(new Order(
                            UUID.randomUUID().toString(),
                            rm.nextInt(3),
                            rm.nextInt(101),
                            //Simulated generation of Order data event time = current time - 5 seconds random * 1000
                            System.currentTimeMillis() - rm.nextInt(5) * 1000
                    ));
                    Thread.sleep(1000);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        });

        //3.Transformation
        //-Tell Flink to calculate based on the event time!
        //env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);// The new version defaults to eventtime
        DataStream<Order> result = source.assignTimestampsAndWatermarks(
                WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                        .withTimestampAssigner((element, recordTimestamp) -> element.eventTime)
        )
                //-The Watermark mechanism is allocated with a maximum delay of 3 seconds to tell which column in the Flink data is the event time, because Watermark = the current maximum event time - the maximum allowable delay time or out of order time
                //When the code comes here, Watermark has been added! Next, you can calculate the window
                //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s
                .keyBy(t -> t.userId)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .sum("money");
        //4.Sink
        result.print();
        //5.execute
        env.execute();
    }
    //Create order class
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order{
        private String orderId;
        private Integer userId;
        private Integer money;
        private Long eventTime;
    }
}
  • Implementation of watermark mechanism by user-defined rewriting interface
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;
import java.util.Random;
import java.util.UUID;

/**
 * Author itcast
 * Date 2021/6/18 16:54
 * Desc TODO
 */
public class WatermarkDemo01 {
    public static void main(String[] args) throws Exception {
        //1.env
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //Set the property ProcessingTime. The new version sets EventTime by default
        //env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        //2.Source creates an Order class orderId:String userId:Integer money:Integer eventTime:Long
        DataStreamSource<Order> source = env.addSource(new SourceFunction<Order>() {
            boolean flag = true;
            Random rm = new Random();

            @Override
            public void run(SourceContext<Order> ctx) throws Exception {
                while (flag) {
                    ctx.collect(new Order(
                            UUID.randomUUID().toString(),
                            rm.nextInt(3),
                            rm.nextInt(101),
                            //Simulated generation of Order data event time = current time - 5 seconds random * 1000
                            System.currentTimeMillis() - rm.nextInt(5) * 1000
                    ));
                    Thread.sleep(1000);
                }
            }

            @Override
            public void cancel() {
                flag = false;
            }
        });

        //3.Transformation
        //-Tell Flink to calculate based on the event time!
        //env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);// The new version defaults to eventtime
        DataStream<Order> result = source.assignTimestampsAndWatermarks(
                WatermarkStrategy.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                        .withTimestampAssigner((element, recordTimestamp) -> element.eventTime)
        )
                //-The Watermark mechanism is allocated with a maximum delay of 3 seconds to tell which column in the Flink data is the event time, because Watermark = the current maximum event time - the maximum allowable delay time or out of order time
                //When the code comes here, Watermark has been added! Next, you can calculate the window
                //It is required to calculate the total order amount of each user within 5 seconds (time-based scrolling window) every 5s
                .keyBy(t -> t.userId)
                .window(TumblingEventTimeWindows.of(Time.seconds(5)))
                .sum("money");
        //4.Sink
        result.print();
        //5.execute
        env.execute();
    }
    //Create order class
    @Data
    @AllArgsConstructor
    @NoArgsConstructor
    public static class Order{
        private String orderId;
        private Integer userId;
        private Integer money;
        private Long eventTime;
    }
}

Flink status management

  • State is the intermediate result based on key or operator
  • Flink state is divided into two types: Managed state - Managed state and Raw state - original state
  • Managed state s are divided into two types:
    1. keyed state is based on the state on the key Supported data structures valueState listState mapState broadcastState
    2. operator state is based on the state of the operation Byte array, ListState

Flink keyed state case

Flink operator state case

IndexOfThisSubtask(); System.out.println("index: "+idx+" offset:"+offset); Thread.sleep(1000); if(offset % 5 ==0){ System.out.println("there is an error in the current program...); throw new Exception("program BUG...); } } } //Override the cancel method @Override public void cancel() { flag = false; }

      //Override the snapshotState method, clear the offsetState, and add the latest offset
      @Override
      public void snapshotState(FunctionSnapshotContext context) throws Exception {
          offsetState.clear();
          offsetState.add(offset);
      }
  }

}

Posted by valentin on Tue, 07 Dec 2021 00:18:57 -0800