Problem guidance
1. What are the operations of converting Flink dual data stream to single data stream? 2. What do cogroup, join and coflatmap do respectively? 3. What is the difference between cogroup, join and coflatmap?
The operations of Flink double data stream to single data stream are cogroup, join,coflatmap and union. Here is a comparison of the functions and usage of these four operations.
- Join: only the element pairs matching the condition are output.
- CoGroup: in addition to outputting matched element pairs, unmatched elements will also be outputted.
- CoFlatMap: there is no matching condition, no matching, and the elements of two streams are processed respectively. On this basis, we can fully realize the functions of join and cogroup, which is more free than their use.
The join instance code is as follows:
private static DataStream<PositionJoinModel> PositionTestJoin( DataStream<ZongShu> grades, DataStream<ZongShu> salaries, long windowSize) { DataStream<PositionJoinModel> apply =grades.join(salaries) //Condition of join: a field in stream1 is equal to the field value in stream2 .where(new partitionsKeySelector1()) .equalTo(new partitionsKeySelector1()) // Specify the window into which data from stream1 and stream2 will enter. Only the data in this window will be join ed by subsequent operations .window(TumblingProcessingTimeWindows.of(Time.milliseconds(windowSize))) .apply(new JoinFunction<ZongShu, ZongShu, PositionJoinModel>() { // The matching data t1 and t2 are captured, where operations such as assembly can be performed @Override public PositionJoinModel join( ZongShu first, ZongShu second) { return new PositionJoinModel(first.getRoom(), first.getPartitions(),first.getNum(), second.getNum()); } }); return apply; }
CoGroup instance code:
private static DataStream<YCSB_LB_RESULT_Model> YCLB_Result_CGroup( DataStream<YCSB_LB_Model> grades, DataStream<YCSB_LB_Model> salaries, long windowSize) { DataStream<YCSB_LB_RESULT_Model> apply = grades.coGroup(salaries) .where(new YCFB_Result_KeySelector()) .equalTo(new YCFB_Result_KeySelector()) .window(TumblingProcessingTimeWindows.of(Time.milliseconds(windowSize))) .apply(new CoGroupFunction<YCSB_LB_Model, YCSB_LB_Model, YCSB_LB_RESULT_Model>() { YCSB_LB_RESULT_Model ylrm = null; @Override public void coGroup(Iterable<YCSB_LB_Model> first, Iterable<YCSB_LB_Model> second, Collector<YCSB_LB_RESULT_Model> collector) throws Exception { ylrm = new YCSB_LB_RESULT_Model(); for (YCSB_LB_Model s : first) { String asset_id = s.getAsset_id(); ylrm.setAsset_id(asset_id); ylrm.setName(s.getName()); ylrm.setIp(s.getIp()); ylrm.setRoom(s.getRoom()); ylrm.setPartitions(s.getPartitions()); ylrm.setBox(s.getBox()); ylrm.setLevel_1(s.getNum()); } for (YCSB_LB_Model s1 : second) { ylrm.setLevel_2(s1.getNum()); } collector.collect(ylrm); } }); return apply; }
coflatmap instance code:
DataStream<Tuple2<String, Integer>> grades = WindowJoinSampleData.GradeSource.getSource(env, rate); DataStream<Tuple2<String, Integer>> salaries = WindowJoinSampleData.SalarySource.getSource(env, rate); KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream = grades.keyBy(0); KeyedStream<Tuple2<String, Integer>, Tuple> tuple2TupleKeyedStream1 = salaries.keyBy(0); SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> tuple3SingleOutputStreamOperator = tuple2TupleKeyedStream .connect(tuple2TupleKeyedStream1) .flatMap(new EnrichmentFunction()); public static class EnrichmentFunction extends RichCoFlatMapFunction<Tuple2<String,Integer>, Tuple2<String,Integer>, Tuple3<String, Integer,Integer>> { // keyed, managed state private ValueState<Tuple2<String,Integer>> rideState; private ValueState<Tuple2<String,Integer>> fareState; @Override public void open(Configuration config) { rideState = getRuntimeContext().getState(new ValueStateDescriptor<>("saved ride", TypeInformation.of(new TypeHint<Tuple2<String,Integer>>() { }))); fareState = getRuntimeContext().getState(new ValueStateDescriptor<>("saved fare", TypeInformation.of(new TypeHint<Tuple2<String,Integer>>() { }))); } @Override public void flatMap1(Tuple2<String,Integer> ride, Collector<Tuple3<String,Integer,Integer>> out) throws Exception { Tuple2<String,Integer> fare = fareState.value(); if (fare != null) { fareState.clear(); out.collect(new Tuple3(ride.f0,ride.f1, fare.f1)); } else { rideState.update(ride); } } @Override public void flatMap2(Tuple2<String,Integer> fare, Collector<Tuple3<String,Integer,Integer>> out) throws Exception { Tuple2<String,Integer> ride = rideState.value(); if (ride != null) { rideState.clear(); out.collect(new Tuple3(ride.f0,ride.f1, fare.f1)); } else { fareState.update(fare); } } }
summary
Although union can merge multiple data streams, it has a limitation that the data types of multiple data streams must be the same. Connect provides similar functions to union to connect two data streams. The difference between connect and union is as follows:
- Connect can only connect two data streams, and union can connect multiple data streams.
- The data types of the two data streams connected by connect can be different, and the data types of the two data streams connected by union must be the same.
- Two datastreams are converted to ConnectedStreams after they are connected. ConnectedStreams will apply different processing methods to the data of the two streams, and the state can be shared between the two streams.