Flink Stream API - Operators

Keywords: Windows REST network

Operators convert one or more data streams into a new data stream.Programs can combine multiple transformations into complex data flow topologies.

Describes basic transformations, valid physical partitions after applying these transformations, and an understanding of Flink operator links.

DataStream Transformations

Conversion Method

Description and code samples

Map
DataStream → DataStream

Gets an element and generates an element.A map function, for example, doubling the value of an input stream:

DataStream<Integer> dataStream = //...
dataStream.map(new MapFunction<Integer, Integer>() {
    @Override
    public Integer map(Integer value) throws Exception {
        return 2 * value;
    }
});
    

 

FlatMap
DataStream → DataStream

Gets an element and generates zero, one, or more elements.A planar mapping function that splits a sentence into words:

dataStream.flatMap(new FlatMapFunction<String, String>() {
    @Override
    public void flatMap(String value, Collector<String> out)
        throws Exception {
        for(String word: value.split(" ")){
            out.collect(word);
        }
    }
});
    

 

Filter
DataStream → DataStream

Calculates a Boolean function for each element, preserving the elements whose function returns true.Filters that filter out zero values:

dataStream.filter(new FilterFunction<Integer>() {
    @Override
    public boolean filter(Integer value) throws Exception {
        return value != 0;
    }
});
    

 

KeyBy
DataStream → KeyedStream

Logically divide the flow into disjoint partitions.All records with the same key are assigned to the same partition.Internally, keyBy() is implemented through a hash partition.There are different ways to specify keys.

This transition returns a KeyedStream, which is required to use the key state.

dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
    

Note: The following types cannot be used as key s:

  1. POJO objects that do not override hashCode rely on Object.hashCode() implementations
  2. Arrays of any type
Reduce
KeyedStream → DataStream

Combines the current element with the last reduction value and returns a new value based on keyBy().

keyedStream.reduce(new ReduceFunction<Integer>() {
    @Override
    public Integer reduce(Integer value1, Integer value2)
    throws Exception {
        return value1 + value2;
    }
});
            

 

Fold
KeyedStream → DataStream

Combines the current element with the last collapsed value and returns a new value based on keyBy().

When applied to a sequence (1,2,3,4,5), the sequence "start-1", "start-1-2", "start-1-2-3",...

DataStream<String> result =
  keyedStream.fold("start", new FoldFunction<Integer, String>() {
    @Override
    public String fold(String current, Integer value) {
        return current + "-" + value;
    }
  });
          

 

Aggregations
KeyedStream → DataStream

Based on keyBy(), min and minBy differ in that Min returns the minimum value, while minBy returns the element with the minimum value in the field (the same is true for max and maxBy).

keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");

 

Window
KeyedStream → WindowedStream

Windows can be defined on a partitioned KeyedStreams.Windows groups data in each key based on certain characteristics, such as data that has arrived in the last five seconds.Window Details//TODO

dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data
    

 

WindowAll
DataStream → AllWindowedStream

It can be defined on a regular data stream.Windows groups all stream events based on some characteristics, such as data arrived in the last five seconds.

Note: In most cases, this is a non-parallel conversion.All records will be collected in one task of the windowAll operator.

dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data
  

 

Window Apply
WindowedStream → DataStream
AllWindowedStream → DataStream

Apply a generic function to the entire window.Below is a function that manually summarizes window elements.

Note: If you are using the windowAll conversion, you need to use the AllWindowFunction.

windowedStream.apply (new WindowFunction<Tuple2<String,Integer>, Integer, Tuple, Window>() {
    public void apply (Tuple tuple,
            Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
});

// applying an AllWindowFunction on non-keyed window stream
allWindowedStream.apply (new AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() {
    public void apply (Window window,
            Iterable<Tuple2<String, Integer>> values,
            Collector<Integer> out) throws Exception {
        int sum = 0;
        for (value t: values) {
            sum += t.f1;
        }
        out.collect (new Integer(sum));
    }
});
    

 

Window Reduce
WindowedStream → DataStream

Apply the reduce function to the window and return the reduced value.

windowedStream.reduce (new ReduceFunction<Tuple2<String,Integer>>() {
    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
        return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1);
    }
});
    

 

Window Fold
WindowedStream → DataStream

Apply the fold function to the window and return the value after the fold.When an example function is applied to a sequence (1,2,3,4,5), the sequence is collapsed into the string "start 1-2-3-4-5":

windowedStream.fold("start", new FoldFunction<Integer, String>() {
    public String fold(String current, Integer value) {
        return current + "-" + value;
    }
});
    

 

Aggregations on windows
WindowedStream → DataStream

Based on window(), in differs from minBy in that min returns the minimum value, whereas minBy returns the element with the minimum value in the field (the same is true for max and maxBy).

windowedStream.sum(0);
windowedStream.sum("key");
windowedStream.min(0);
windowedStream.min("key");
windowedStream.max(0);
windowedStream.max("key");
windowedStream.minBy(0);
windowedStream.minBy("key");
windowedStream.maxBy(0);
windowedStream.maxBy("key");
    

 

Union
DataStream* → DataStream

Union of two or more data streams to create a new stream containing all elements from all streams.Note: If you combine a data stream with itself, you get each element in the result stream twice.

dataStream.union(otherStream1, otherStream2, ...);

 

Window Join
DataStream,DataStream → DataStream

Connect two data streams on a given key and public window.

dataStream.join(otherStream)
    .where(<key selector>).equalTo(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new JoinFunction () {...});
    

 

Interval Join
KeyedStream,KeyedStream → DataStream

Connect two elements e1 and e2 of two key streams with a common key at a given time interval, so that

e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound

// this will join the two streams so that
// key1 == key2 && leftTs - 2 < rightTs < leftTs + 2
keyedStream.intervalJoin(otherKeyedStream)
    .between(Time.milliseconds(-2), Time.milliseconds(2)) // lower and upper bound
    .upperBoundExclusive(true) // optional
    .lowerBoundExclusive(true) // optional
    .process(new IntervalJoinFunction() {...});
    

 

Window CoGroup
DataStream,DataStream → DataStream

Combines two data streams on a given key and a public window.

dataStream.coGroup(otherStream)
    .where(0).equalTo(1)
    .window(TumblingEventTimeWindows.of(Time.seconds(3)))
    .apply (new CoGroupFunction () {...});
    

 

Connect
DataStream,DataStream → ConnectedStreams

Connect two data streams of reserved type.Allows a state-sharing connection between two streams.

DataStream<Integer> someStream = //...
DataStream<String> otherStream = //...

ConnectedStreams<Integer, String> connectedStreams = someStream.connect(otherStream);
    

 

CoMap, CoFlatMap
ConnectedStreams → DataStream

Similar to map and flatMap on connect-based data streams

connectedStreams.map(new CoMapFunction<Integer, String, Boolean>() {
    @Override
    public Boolean map1(Integer value) {
        return true;
    }

    @Override
    public Boolean map2(String value) {
        return false;
    }
});
connectedStreams.flatMap(new CoFlatMapFunction<Integer, String, String>() {

   @Override
   public void flatMap1(Integer value, Collector<String> out) {
       out.collect(value.toString());
   }

   @Override
   public void flatMap2(String value, Collector<String> out) {
       for (String word: value.split(" ")) {
         out.collect(word);
       }
   }
});
    

 

Split
DataStream → SplitStream

Divides a stream into two or more streams according to some criteria.

SplitStream<Integer> split = someDataStream.split(new OutputSelector<Integer>() {
    @Override
    public Iterable<String> select(Integer value) {
        List<String> output = new ArrayList<String>();
        if (value % 2 == 0) {
            output.add("even");
        }
        else {
            output.add("odd");
        }
        return output;
    }
});
                

 

Select
SplitStream → DataStream

Select one or more streams from the shunt.

SplitStream<Integer> split;
DataStream<Integer> even = split.select("even");
DataStream<Integer> odd = split.select("odd");
DataStream<Integer> all = split.select("even","odd");
                

 

Iterate
DataStream → IterativeStream → DataStream

Create a feedback loop in the stream by redirecting the output of one operator to a previous operator.This is particularly useful for defining algorithms that constantly update models.The following code starts from a stream and continues to apply an iteration volume.Elements greater than 0 are sent back to the feedback channel, and the rest are forwarded downstream.Iterations Subsequent Details//TODO

IterativeStream<Long> iteration = initialStream.iterate();
DataStream<Long> iterationBody = iteration.map (/*do something*/);
DataStream<Long> feedback = iterationBody.filter(new FilterFunction<Long>(){
    @Override
    public boolean filter(Long value) throws Exception {
        return value > 0;
    }
});
iteration.closeWith(feedback);
DataStream<Long> output = iterationBody.filter(new FilterFunction<Long>(){
    @Override
    public boolean filter(Long value) throws Exception {
        return value <= 0;
    }
});
                

 

Extract Timestamps
DataStream → DataStream
Extract timestamps from records to use windows that use event time semantics.Event Time Subsequent Details//TODO
Project
DataStream → DataStream

The following transformations can be used for the data flow of tuples:

Select a subset of fields from a tuple

DataStream<Tuple3<Integer, Double, String>> in = // [...]
DataStream<Tuple2<String, Integer>> out = in.project(2,0);

 

Physical partition

Flink also provides low-level control over the converted flow partitions through the following functions, if required.

Method Description and code samples
Custom partitioning
DataStream → DataStream

Select a target task for each element using a user-defined partitioner.

dataStream.partitionCustom(partitioner, "someKey");
dataStream.partitionCustom(partitioner, 0);
            

 

Random partitioning
DataStream → DataStream

Randomly divide elements according to uniform distribution.

dataStream.shuffle();
            

 

Rebalancing (Round-robin partitioning)
DataStream → DataStream

Partition element loops, creating the same load for each partition.This is useful for performance optimization where there is data skew.

dataStream.rebalance();

 

Rescaling
DataStream → DataStream

Divide elements (loops) into subsets of downstream operations.This is useful if you want pipelines, for example, to spread the load from each parallel instance of a source to a subset of several mappers, but you don't want rebalance() to result in a complete rebalance of the data.This will require only local data transfer, not over the network, depending on other configuration values, such as the number of slots in TaskManagers.

The subset of downstream operations to which an upstream operation sends elements depends on the degree of parallelism between the upstream and downstream operations.For example, if an upstream operation has parallelism 2 and a downstream operation has parallelism 6, one upstream operation will assign elements to three downstream operations and the other upstream operation to three other downstream operations.On the other hand, if the parallelism of the downstream operation is 2 and that of the upstream operation is 6, the parallelism of the upstream operation is 3.

One or more downstream operations have a different number of inputs than upstream operations when different degrees of parallelism are not multiples of each other.

dataStream.rescale();
            

 

Broadcasting
DataStream → DataStream

Broadcast elements to each partition.

dataStream.broadcast();

 

Task Links and Resource Groups

chaining means putting them in the same thread for better performance.If possible, the Flink chain operator is used by default (for example, two subsequent mapping transformations).The API can fine-grained control links if needed:

If you want to disable links throughout the job, use StreamExecutionEnvironment.disableOperator Chaining ().For finer-grained control, the following functions can be used.Note that these functions can only be used after a DataStream transformation because they refer to the previous transformation.For example, you can use someStream.map(...). startnewchain(), but someStream.startNewChain() cannot be used.

resource group is a slot in Flink, see Slot//TODO.If needed, operators can be manually isolated in separate slots.

Method Description and code samples
Start new chain

Start with this operator and create a new chain.These two mappers will be linked, and filters will not be linked to the first mapper.

someStream.filter(...).map(...).startNewChain().map(...);

 

Disable chaining
Do not chain the map operator 

 

Do not chain the map operator

Set the slot share group for the operation.Flink will place the operators with the same slot sharing group in the same slot, while keeping the operators without the slot sharing group in the other slots.This can be used to isolate the slots.If all input operators are in the same slot sharing group, the slot sharing group will inherit from the input operator.The default slot sharing group is named'default', and operations can be explicitly placed in the group by calling slotSharingGroup('default').

someStream.filter(...).slotSharingGroup("name");

 

 

Posted by michaelphipps on Tue, 20 Aug 2019 18:09:13 -0700