Flink of big data

Keywords: Big Data Machine Learning flink

preface

stay Flink of big data (Part I) In this paper, we introduce the characteristics, architecture, two-stage submission and data flow of Flink. This paper introduces the unique operator of Flink and the case of implementing WordCount with Flink

1, split and select operators

The split operator splits a DataStream into two or more datastreams according to some characteristics.

The Select operator obtains one or more datastreams from a SplitStream.

The code is as follows:

//Cut according to the label
val splitStream:SplitStream[Startuplog] = startuplogDstream.split{
      startuplog =>
        var flag:List[String] = null;
        if(startuplog.ch == "appstore"){
          flag = List("apple","usa")
        }else if(startuplog.ch == "huawei"){
          flag = List("android","china")
        }else{
          flag = List("android","other")
        }
        flag
}//Divide the data into multiple streams according to the label columns in the data stream
val appleStream:DataStream[Startuplog] = startuplogDstream.select("apple","china")
val otherStream:DataStream[Startuplog] = startuplogDstream.select("other")
//As required, the segmented stream is obtained for subsequent processing

2, Connect and CoMap operators

The Connect operator connects two data streams that maintain their types. After the two data streams are connected, they are only placed in the same stream, and their respective data and forms remain unchanged. The two streams are independent of each other.

The COMAP and coflatmap operators act on ConnectedStreams. Their functions are the same as those of map and flatMap, and map and flatMap are processed for each Stream in ConnectedStreams respectively.

Note: map/flatMap should specify how to combine multiple data streams, that is, specify different functions for different streams, and the return type of the function must be the same and consistent with the final return type. Ordinary map/flatMap can be used directly because it is only for the same data stream;
The code is as follows:

val conStream:ConnectedStreams[Startuplog,Startuplog] = appleStream.connect(otherStream)
val allStream:DataStream[String] = conStream.map(
  //Each stream must specify a function, and the return type of the function must be consistent with the required type (here is String)
  (startuplog1:Startuplog) => startuplog1.ch
  (startuplog2:Startuplog) => startuplog2.ch
)

3, union operator

  Union two or more datastreams to generate a new DataStream containing all DataStream elements. Note: if you union a DataStream with itself, you will see that each element appears twice in the new DataStream.

The code is as follows:

val unionStream:DataStream[Startuplog] = appleStream.union(otherStream)

  when merging data streams, union can merge directly without requiring the same data stream types, while connect needs to put the data stream into a large stream for data type conversion before merging. At the same time, * * connect can only merge two data streams at a time, while union can merge multiple data streams.

4, WordCount case

4.1 offline data

The code is as follows:

// Create an env environment variable
    val env = ExecutionEnvironment.getExecutionEnvironment

    val textDataSet: DataSet[String] = env.readTextFile("D:\\data\\1.txt")

    val aggset: AggregateDataSet[(String, Int)] = textDataSet.flatMap(_.split(" ")).map((_,1)).groupBy(0).sum(1)

    aggset.print()

4.2 online data

The code is as follows:

// Create an env environment variable
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val dataStream: DataStream[String] = env.socketTextStream("hadoop1",7777)

    val aggStream: DataStream[(String, Int)] = dataStream.flatMap(_.split(" ")).map((_,1)).keyBy(0).sum(1)

    aggStream.print()

    env.execute() //Online data needs to be added with execution

summary

stay Flink of big data (Part I) In this paper, we introduce the characteristics, architecture, two-stage submission and data flow of Flink. This paper introduces the unique operator of Flink and the case of implementing WordCount with Flink. If there is anything to be added or insufficient, I hope you can point out that we can make progress together.

Posted by hoogeebear on Mon, 22 Nov 2021 07:22:57 -0800