Flink reality: Statistics website PV, UV

Keywords: Eclipse Database Hadoop HBase

Flink reality: Statistics website PV, UV

PV,UV

  • PV (Page View): Page clicks
  • UV (User View): Number of visits by independent users

Assume the requirements are as follows, every minute interval, count the last 5 minutes of UV, PV.It is easy to imagine that the correct results can be obtained from the count and count distinct of the database.Traditional databases or HADOOP (hbase...) counts are inefficient with large amounts of data.If the data is incremental, streaming computing tends to provide higher throughput and lower latency.

Next, this functionality is implemented using Flink, and this case describes some of Flink's basic concepts.If you are familiar with other streaming computing frameworks, you can see that many things are interoperable.

Window

It's easy to understand that in this case, we need to cache 5 minutes of data in memory, move forward one minute, count once, and clean up the data.

Flink provides Multiple windows Can be selected on demand.

Event Time

Given the latency of the network and the disorder of the data, it is not easy to use Flink's system time for statistics.For example, data with a score of 14:25 may not arrive in the system until 14:27. If you press Flink system time directly, it will affect the calculation results for the period from 14:20 to 14:25, as well as 14:25 to 14:30.

In Flink, there are three time characteristics. View detailed instructions:

  • Processing time: The time the Operator processes data.
  • Event time: The time at which the event occurred.
  • Ingestion time: The time consumed by Flink.

When counting PV UV s, we need to use Event Time based on the time the user visits.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Next, we need to tell Flink the true time timestamp of the record and the watermark that triggers the window calculation.This is done in Flink by implementing the interface AssignerWithPeriodicWatermarks.

Choose BoundedOutOfOrderness TimestampExtractor, considering that the data may be out of order:

      long MAX_EVENT_DELAY = 3500;
      BoundedOutOfOrdernessTimestampExtractor<String> assigner = new BoundedOutOfOrdernessTimestampExtractor<String>(Time.milliseconds(MAX_EVENT_DELAY)) {
            @Override
            public long extractTimestamp(String element) {
                VisitEvent visitEvent = null;
                try {
                    visitEvent = objectMapper.readValue(element, VisitEvent.class);
                    return visitEvent.getVisitTime();
                } catch (IOException e) {
                    e.printStackTrace();
                }
                return Instant.now().toEpochMilli();
            }
        };

Some of the above code is mainly used for time processing, and the real calculation is done through window s, as shown below.

        int[] arr = {0,2};
        FlinkKafkaConsumerBase<String> consumerWithEventTime = myConsumer.assignTimestampsAndWatermarks(assigner);
        TypeInformation<Tuple3<String, VisitEvent, String>> typeInformation = TypeInformation.of(new TypeHint<Tuple3<String, VisitEvent, String>>() {});
        DataStreamSource<String> dataStreamByEventTime = env.addSource(consumerWithEventTime);
        SingleOutputStreamOperator<UrlVisitBy> uvCounter = dataStreamByEventTime
                .map(str->objectMapper.readValue(str,VisitEvent.class))
                .map(visitEvent-> new Tuple3<>(visitEvent.getVisitUrl(), visitEvent,visitEvent.getVisitUserId()))
                .returns(typeInformation)
                .keyBy(arr)
                .window(SlidingProcessingTimeWindows.of(Time.minutes(5), Time.minutes(1),Time.hours(-8)))
                .allowedLateness(Time.minutes(1))
                .process(new ProcessWindowFunction<Tuple3<String, VisitEvent, String>, UrlVisitBy, Tuple, TimeWindow>() {
                    @Override
                    public void process(Tuple tuple, Context context, Iterable<Tuple3<String, VisitEvent, String>> elements, Collector<UrlVisitBy> out) throws Exception {
                        long count = 0;
                        Tuple2<String,String> tuple2 = null;
                        if (tuple instanceof Tuple2){
                            tuple2 = (Tuple2) tuple;
                        }
                        for (Tuple3<String, VisitEvent, String> element : elements) {
                            count++;
                        };
                        TimeWindow window = context.window();
                        out.collect(new UrlVisitBy(window.getStart(),window.getEnd(),tuple2.f0,count,tuple2.f1));
                    }
                });
        uvCounter.print();

returns(typeInformation) is recommended

Since the JDK default compiler erases generic information during compilation, Flink cannot get enough information to infer the true type at execution time, so it may encounter the error "The generic type parameters of'XXX'are missing".

Now only Eclipse JDT compiler Enough information can be preserved after compilation, but it limits developers to compiling and debugging using Eclipse only.Also due to Compatibility issues, Eclipse support for Flink is not friendly .Intelij idea is officially recommended.

To get rid of compiler limitations, Flink uses TypeInfomation to tell Flink what the real type is.

Posted by jlr2k8 on Thu, 16 May 2019 01:57:37 -0700