Flink Zero Foundation Practical Course: How to Calculate Real-time Hot Commodities

Keywords: Big Data Windows Java github curl

In the last introductory tutorial, we were able to quickly build a basic Flink program. This article will take you step by step to implement a more sophisticated Flink application: real-time hot commodities. Before starting this article, we recommend that you practice the previous article first, because this article will follow the my-flink-project project framework described above.

Through this article you will learn:

  • How to process based on EventTime and how to specify Watermark
  • How to Use Flink's Flexible Window API
  • When and how to use State
  • How to Use ProcessFunction to Realize TopN Function

Introduction of actual combat cases

The need for "real-time hot commodities" can be translated into a need that programmers can better understand: output the top N products with the most clicks in the last hour every five minutes. To decompose this requirement, we have to do a few things about it:

  • Extract business timestamp and tell Flink framework to do window based on business time
  • Filter out click behavior data
  • Sliding Window aggregation is done every 5 minutes according to the window size of one hour.
  • Aggregate by each window and output the goods with the top N names in each window.

Data preparation

Here we have prepared a Taobao User Behavior Data Set (from Ali Yuntianchi Open Data Set, thanks in particular). This data set contains all the behaviors (including clicking, purchasing, adding and collecting) of a random 1 million users on Taobao one day. The organizational form of the data set is similar to that of Movie Lens-20M, that is, each row of the data set represents a user behavior, which is composed of user ID, commodity ID, commodity category ID, behavior type and timestamp, and is separated by commas. Detailed descriptions of each column in the data set are as follows:

Column name Explain
User ID Integer type, encrypted user ID
Commodity ID Integer type, encrypted commodity ID
Commodity Category ID Integer type, encrypted commodity category ID
Behavior types String, enumeration type, including ('pv','buy','cart','fav')
time stamp Time stamp, unit second, for behavior occurrence

You can download the data set to the resources directory of the project by following commands:

$ cd my-flink-project/src/main/resources
$ curl https://raw.githubusercontent.com/wuchong/my-flink-project/master/src/main/resources/UserBehavior.csv > UserBehavior.csv

It doesn't matter whether curl command is used to download data here. You can also download data using wget command or direct access link. The key is to save the data files to the resources directory of the project to facilitate application access.

Programming

Create the HotItems.java file under src/main/java/myflink:

package myflink;

public class HotItems {

  public static void main(String[] args) throws Exception {

  }
}

As above, we will fill in the code step by step. The first step is still to create a Stream Execution Environment, which we add to the main function.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// In order to print the results to the console without disorder, we configure global concurrency to be 1, where changing concurrency has no effect on the correctness of the results.
env.setParallelism(1);

Creating Analog Data Sources

In the data preparation section, we have downloaded the test data set locally. Since it is a csv file, we will use CsvInputFormat to create an analog data source.

Note: Although a streaming application should be a running program, it needs to consume an unlimited data source. But in this case tutorial, in order to avoid the tedious construction of real data sources, we use files to simulate real data sources, which does not affect the knowledge points described below. This is also a common way to locally verify the correctness of Flink applications.

We first create a POJO class of UserBehavior (all member variables declared as public are POJO classes), which can be strongly typed to facilitate subsequent processing.

/** User Behavior Data Structure**/
public static class UserBehavior {
  public long userId;         // User ID
  public long itemId;         // Commodity ID
  public int categoryId;      // Commodity Category ID
  public String behavior;     // User behavior, including ("pv", "buy", "cart", "fav")
  public long timestamp;      // Time stamp, unit second, for behavior occurrence
}

Next we can create a PojoCsvInputFormat, which reads the csv file and converts each line to the specified POJO
Type (in our case, UserBehavior) input.

// UserBehavior.csv's local file path
URL fileUrl = HotItems2.class.getClassLoader().getResource("UserBehavior.csv");
Path filePath = Path.fromLocalFile(new File(fileUrl.toURI()));
// Extracting Type Information from UserBehavior is a PojoTypeInfo
PojoTypeInfo<UserBehavior> pojoType = (PojoTypeInfo<UserBehavior>) TypeExtractor.createTypeInfo(UserBehavior.class);
// Because the order of fields extracted by Java reflection is uncertain, the order of fields in the following file needs to be explicitly specified
String[] fieldOrder = new String[]{"userId", "itemId", "categoryId", "behavior", "timestamp"};
// Create PojoCsvInputFormat
PojoCsvInputFormat<UserBehavior> csvInput = new PojoCsvInputFormat<>(filePath, pojoType, fieldOrder);

Next we create the input source with PojoCsvInputFormat.

DataStream<UserBehavior> dataSource = env.createInput(csvInput, pojoType);

This creates a UserBehavior-type DataStream.

EventTime and Watermark

When we say "counting the clicks in the past hour", what does "one hour" mean here? In Flink, it can mean either Processing Time or EventTime, which is determined by the user.

  • Processing Time: The time when an event is processed. It is determined by the system time of the machine.
  • EventTime: When the event occurred. Usually it's the time the data itself takes.

In this case, we need to count the number of hits per hour in business time, so we need to deal with them based on EventTime. So what if Flink had to deal with the business time we wanted? There are two main things to do here.

The first is to tell Flink that we are now processing in EventTime mode. Flink defaults to Processing Time, so we have to explicitly set it up.

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

The second thing is to specify how to get business time and generate Watermark. Watermark is a concept used to track business events, which can be understood as a clock in the EventTime world to indicate when data is currently being processed. Since the data of our data source has been sorted out, there is no disorder, that is, the time stamp of events is monotonously increasing, so the business time of each data can be regarded as Watermark. Here we use Ascending Timestamp Extractor to extract timestamps and generate Watermark.

Note: Real business scenarios are usually disordered, so Bounded Out of Order Timestamp Extractor is commonly used.

DataStream&lt;UserBehavior&gt; timedData = dataSource
.assignTimestampsAndWatermarks(new AscendingTimestampExtractor&lt;UserBehavior&gt;() {
@Override
public long extractAscendingTimestamp(UserBehavior userBehavior) {
// The original data is in seconds, converted to milliseconds
return userBehavior.timestamp * 1000;
}
});

In this way, we can get a data stream with time label, and then we can do some window operations.

Filter out click events

Before starting the window operation, review the requirement of "Output the top N items with the most clicks in the past hour every five minutes". Since there are click-on, add-on, purchase and collection data in the original data, but we only need to count the number of clicks, so we use FilterFunction to filter out the click-on behavior data first.

DataStream<UserBehavior> pvData = timedData
    .filter(new FilterFunction<UserBehavior>() {
      @Override
      public boolean filter(UserBehavior userBehavior) throws Exception {
        // Filter out click-only data
        return userBehavior.behavior.equals("pv");
      }
    });

Window statistics clicks

The window size is one hour and slides every five minutes, since the clicks of each item in the last hour are counted every five minutes. That is to say, statistics [09:00, 10:00], [09:05, 10:05], [09:10, 10:10)... Wait for the clicks of the products in the window. Sliding Window is a common requirement.

DataStream<ItemViewCount> windowedData = pvData
    .keyBy("itemId")
    .timeWindow(Time.minutes(60), Time.minutes(5))
    .aggregate(new CountAgg(), new WindowResultFunction());

We use. keyBy("itemId") to group goods and. Time Window (Time size, Time slide) to make sliding windows for each item (1 hour window, 5 minutes sliding once). Then we use. aggregate (Aggregate Function af, Windows Function wf) to do incremental aggregation operations, which can use Aggregate Function to aggregate data ahead of time and reduce the storage pressure of state. In contrast, application (Windows Function wf) stores all the data in the window, and the last calculation is much more efficient. The first parameter of the aggregate() method is used for

CountAgg here implements the Aggregate Function interface, which counts the number of entries in the window, i.e. add one data when it encounters one.

/** COUNT The aggregation function of statistics is implemented, one record plus one for each occurrence. */
public static class CountAgg implements AggregateFunction<UserBehavior, Long, Long> {

  @Override
  public Long createAccumulator() {
    return 0L;
  }

  @Override
  public Long add(UserBehavior userBehavior, Long acc) {
    return acc + 1;
  }

  @Override
  public Long getResult(Long acc) {
    return acc;
  }

  @Override
  public Long merge(Long acc1, Long acc2) {
    return acc1 + acc2;
  }
}

The second parameter of. aggregate (Aggregate Function af, Windows Function wf) Windows Function takes the aggregated results of each key window with other information for output. The Windows Result Function we implemented here encapsulates the primary key commodity ID, window and click volume into an ItemViewCount for output.

/** Output window results */
public static class WindowResultFunction implements WindowFunction<Long, ItemViewCount, Tuple, TimeWindow> {

  @Override
  public void apply(
      Tuple key,  // The primary key of the window, itemId
      TimeWindow window,  // window
      Iterable<Long> aggregateResult, // The result of the aggregation function, the count value
      Collector<ItemViewCount> collector  // The output type is ItemViewCount
  ) throws Exception {
    Long itemId = ((Tuple1<Long>) key).f0;
    Long count = aggregateResult.iterator().next();
    collector.collect(ItemViewCount.of(itemId, window.getEnd(), count));
  }
}

/** Commodity clicks (output type of window operation) */
public static class ItemViewCount {
  public long itemId;     // Commodity ID
  public long windowEnd;  // Window end timestamp
  public long viewCount;  // Click-throughput of merchandise

  public static ItemViewCount of(long itemId, long windowEnd, long viewCount) {
    ItemViewCount result = new ItemViewCount();
    result.itemId = itemId;
    result.windowEnd = windowEnd;
    result.viewCount = viewCount;
    return result;
  }
}

Now we get the click-through data stream of each item in each window.

TopN calculates the hottest commodities

In order to count the hottest items under each window, we need to group them by window again. Here we do keyBy() according to the window End in ItemViewCount. Then a custom TopN function, TopNHotItems, is implemented using ProcessFunction to calculate the top three items in the click list, and the ranking results are formatted into strings to facilitate subsequent output.

DataStream<String> topItems = windowedData
    .keyBy("windowEnd")
    .process(new TopNHotItems(3));  // Goods for Top 3 Click-throughs

ProcessFunction is a low-level API provided by Flink for more advanced functions. It mainly provides the function of timer (supporting EventTime or Processing Time). In this case, we will use timer to determine when to collect click-through data for all products under a certain window. Because Watermark's progress is global,

In the process Element method, whenever we receive an ItemViewCount, we register a timer for Windows End+1 (Flink framework automatically ignores repeated registration at the same time). When the timer of windows End+1 is triggered, it means that the Watermark of windows End+1 is received, that is, all the commodity window statistics under the windows End are collected. In onTimer(), we sort all the items and clicks collected, select TopN, and format the ranking information into strings for output.

Here we also use ListState < ItemViewCount > to store every ItemViewCount message received to ensure that the status data is not lost and consistent in case of failure. ListState is Flink's Java List interface-like State API, which integrates the checkpoint mechanism of the framework and automatically ensures exactly-once semantics.

/** Find the top N click items in a window, key is the window timestamp, output is the result string of TopN. */
public static class TopNHotItems extends KeyedProcessFunction<Tuple, ItemViewCount, String> {

  private final int topSize;

  public TopNHotItems(int topSize) {
    this.topSize = topSize;
  }

  // Used to store the status of goods and clicks, to collect data from the same window, and then trigger TopN calculation.
  private ListState<ItemViewCount> itemState;

  @Override
  public void open(Configuration parameters) throws Exception {
    super.open(parameters);
    // Registration of status
    ListStateDescriptor<ItemViewCount> itemsStateDesc = new ListStateDescriptor<>(
        "itemState-state",
        ItemViewCount.class);
    itemState = getRuntimeContext().getListState(itemsStateDesc);
  }

  @Override
  public void processElement(
      ItemViewCount input,
      Context context,
      Collector<String> collector) throws Exception {

    // Every piece of data is saved in state.
    itemState.add(input);
    // Register EventTime Timer for Windows End+1. When triggered, it shows that all merchandise data belonging to Windows End window are collected.
    context.timerService().registerEventTimeTimer(input.windowEnd + 1);
  }

  @Override
  public void onTimer(
      long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
    // Get clicks on all items received
    List<ItemViewCount> allItems = new ArrayList<>();
    for (ItemViewCount item : itemState.get()) {
      allItems.add(item);
    }
    // Clean up data in state ahead of time and release space
    itemState.clear();
    // Sort the clicks from big to small
    allItems.sort(new Comparator<ItemViewCount>() {
      @Override
      public int compare(ItemViewCount o1, ItemViewCount o2) {
        return (int) (o2.viewCount - o1.viewCount);
      }
    });
    // Format the ranking information into String for easy printing
    StringBuilder result = new StringBuilder();
    result.append("====================================\n");
    result.append("time: ").append(new Timestamp(timestamp-1)).append("\n");
    for (int i=0;i<topSize;i++) {
      ItemViewCount currentItem = allItems.get(i);
      // No1: Commodity ID=12224 views = 2413
      result.append("No").append(i).append(":")
            .append("  commodity ID=").append(currentItem.itemId)
            .append("  Browsing volume=").append(currentItem.viewCount)
            .append("\n");
    }
    result.append("====================================\n\n");

    out.collect(result.toString());
  }
}

Print Output

In the last step, we print out the results to the console and call env.execute to perform the task.

topItems.print();
env.execute("Hot Items Job");

Running program

By running the main function directly, you can see the hot product ID that is constantly being exported at every point in time.

summary

The full code in this article can be accessed through GitHub. This paper studies and practices Flink's core concepts and API usage by implementing a real-time hot commodity case. Including the use of EventTime, Watermark, State, Windows API, and TopN implementation. I hope this article can deepen your understanding of Flink and help you solve the problems encountered in the actual combat.

Step GitHub to access the entire code: https://github.com/wuchong/my-flink-project/blob/master/src/main/java/myflink/HotItems.java

Posted by ozone on Thu, 20 Jun 2019 13:17:10 -0700