Hello, everyone, I am later, I will share my learning and work in the drips, hope that I have an opportunity to help you some articles, all articles will be launched in the official account, welcome to my official account "later X big data", thank you for your support and recognition.
It's another week without change. I went back to Yuncheng last weekend to see my teeth. I've been on the road all the time. I'm too tired. Back to the point, about Getting started with flink I've talked about it in the last article.
Today I'm going to talk about the API of stream processing. All the code in this article is scala.
Then we have to go back to the last WordCount code. The Flink program looks like a regular program for converting data sets. Each program has the same basic parts:
- Get execution environment
- Load / create initial data
- Specify the conversion for this data
- Specify where to put the calculation results
- Trigger program execution
Get execution environment
So if you want to process data, you have to start from getting the execution environment. StreamExecutionEnvironment is the foundation of all Flink programs, so let's get an execution environment. There are three static methods
- getExecutionEnvironment()
- createLocalEnvironment()
- createRemoteEnvironment(host: String, port: Int, jarFiles: String*)
//Get context val contextEnv = StreamExecutionEnvironment.getExecutionEnvironment //Get local environment val localEnv = StreamExecutionEnvironment.createLocalEnvironment(1) //Get cluster environment val romoteEnv = StreamExecutionEnvironment.createRemoteEnvironment("bigdata101",3456,2,"/ce.jar")
But generally speaking, we only need to use the first getExecutionEnvironment(), because it will perform the correct operation according to the context; that is to say, it will decide what kind of running environment to return according to the way the query is run. If you are an IDE execution, it will return to the local execution environment. If you are a cluster execution, it will return to the cluster execution environment.
Predefined data flow sources
OK, after getting the environment, we will start to get the data source. flink supports multiple data sources. First, let's look at several predefined data flow sources
-
File based
- readTextFile(path)- TextInputFormat reads the text file line by line and returns it as a string, only once.
- readFile(fileInputFormat, path) - reads the file according to the specified file input format, only once.
But in fact, the above two methods are all called readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)
Let's look at the source code:
We choose the first simple method to enter, and we see the following figure. We find that in fact, the above two methods will eventually fall into the readFile(fileInputFormat, path, watchType, interval, pathFilter) method, but the parameters later are default values.
Therefore, these parameters can also be specified by themselves. Well, this method is not commonly used by everyone, so let's introduce it briefly. If you need a little partner, try these parameters.
- Socket based
socketTextStream - read from socket. Elements can be delimited by delimiters.
Here's the socket. I'm here Finally understand why TCP protocol is reliable. Transport layer of Computer Foundation (6) Yes. Let's talk about it again:
Socket socket = {IP address: port number}, example: 192.168.1.99:3456
The code is used as follows:
val wordDS: DataStream[String] = contextEnv.socketTextStream("bigdata101",3456)
Sockets are abstract and exist only to represent TCP connections.
- Set based
- fromCollection(Seq) - from Java Java.util.Collection Create a data flow. All elements in the collection must have the same type.
- From collection (iterator) - creates a data flow from an iterator. This class specifies the data type of the element returned by the iterator.
- fromElements(elements: _ *) - creates a data flow from a given sequence of objects. All objects must have the same type.
- From parallel collection (split iterator) - creates a data stream in parallel from an iterator. This class specifies the data type of the element returned by the iterator.
- Generate sequence (from, to) - generates a sequence of numbers in a given interval in parallel.
These preset data sources do not use a lot, so to speak, they are almost unnecessary. So you can try it yourself.
Note, of course, that if you use fromCollection(Seq), it's from the Java.util.Collection Create data streams, so if you're programming in scala, you need to introduce implicit transformations
import org.apache.flink.streaming.api.scala._
Get data Source
You can also find that the above methods almost all obtain data from a fixed data source and are suitable for self testing, but they can't be used in production, so let's take a look at the serious data sources:
Officially supported source s and sink are as follows:
- Apache Kafka (source / receiver)
- Apache Cassandra (receiver)
- Amazon Kinesis Streams (source / receiver)
- Elasticsearch (receiver)
- Hadoop file system (receiver)
- RabbitMQ (source / receiver)
- Apache NiFi (source / receiver)
- Twitter Streaming API (source)
- Google PubSub (source / receiver)
The three bold ones are commonly used in daily life. It is also found that the data source is Kafka, and the sink has ES and HDFS. Let's talk about kafka Source first, not about Kafka installation and deployment. Google. Let's post code and analysis.
Kafka Source
- stay pom.xml Import kafka dependency in
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.11_2.11</artifactId> <version>1.7.2</version> </dependency>
This involves the issue of version: you can debug according to your version. But note: since the current Flink version 1.7, the general Kafka connector is considered to be in BETA state and may not be as stable as the 0.11 connector. So we suggest that you use flink-connector-kafka-0.11_ two point one one
- Paste test code
import java.util.Properties import org.apache.flink.api.common.serialization.SimpleStringSchema import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.streaming.connectors.kafka. FlinkKafkaConsumer011 import org.apache.flink.streaming.api.scala._ /** * @description: ${kafka Source Test} * @author: Liu Jun Jun * @create: 2020-06-10 10:56 **/ object kafkaSource { def main(args: Array[String]): Unit = { //Get execution environment val env = StreamExecutionEnvironment.getExecutionEnvironment val properties = new Properties() //Configure kafka connector properties.setProperty("bootstrap.servers", "bigdata101:9092") properties.setProperty("group.id", "test") //Set serialization method properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") //Set offset consumption method: the parameters that can be set are: earliest, latest, none properties.setProperty("auto.offset.reset", "latest") //earliest: when there are submitted offsets under each partition, consumption starts from the submitted offset; when there is no submitted offset, consumption starts from the beginning //latest: when there is a submitted offset under each partition, consumption starts from the submitted offset; when there is no submitted offset, consumption of the newly generated data under the partition //none: when the submitted offset exists in each partition of the topic, the consumption starts after the offset; if there is no submitted offset in one partition, an exception will be thrown val kafkaDS: DataStream[String] = env.addSource( new FlinkKafkaConsumer011[String]( "test1", new SimpleStringSchema(), properties ) ) kafkaDS.print("Test:") env.execute("kafkaSource") } }
In this case, we should pay attention to the difference of three parameters in the way of consumption offset:
- If there is a submitted offer, whether it is set to earliest or latest, it will be consumed from the submitted offer
- If there is no submitted offer, earthist means consumption from the beginning, and latest means consumption from the latest data, that is, newly generated data
- none:topic when there is a submitted offset in each partition, consumption starts from the submitted offest; as long as there is a partition that does not have a submitted offset, an exception is thrown
The setting of kafka serialization needs to be configured according to the actual needs.
The above is just a simple use of Kafka as a data source to obtain data. As for the common topics such as how to do checkpoints and accurate one-time consumption, we will take them out separately later. This time, we will go through the basic API
Transform operator
After saying the Source, the next step is Transform. There are many operators like this. The official website is very complete. I post the link here. You can directly see the official website: Introduction to the conversion operator of flink official website
And what we often use is the following. It can be said that the operator with spark function is almost the same. So let's take a brief look:
- Map mapping - mapping by element will generate a new data stream; DataStream → DataStream
//Input word to (word, 1) wordDS.map((_,1))
- FlatMap flattening, DataStream → DataStream
//Enter a line of strings to segment words by spaces dataStream.flatMap(_.split(" "))
- Filter filter, DataStream → DataStream
//Filter out the number with the remainder of 2 equal to 0 dataStream.filter(_ % 2 == 0)
- KeyBy grouping
//Calculate the word count and group it by words. 0 here refers to the number of tuple s, because (word,1) this kind of new flow is grouped by word, and word is the 0 wordDS.map((_,1)).keyBy(0)
- reduce aggregation
//Aggregate the (word, count) after the above KeyBy, merge the current element and the result of the last aggregation, and realize wordCount wordDS.map((_,1)).keyBy(0).reduce{ (s1,s2) =>{ (s1._1,s1._2 + s2._2) } }
As for the operators of double flow join and window, I will repeat in the next station. In this article, we first understand some common and simple API purposes to achieve.
function
In flink, in addition to the above-mentioned simple conversion operators, we often encounter some problems that can not be solved by the above operators, so we need to customize UDF functions
About UDF, UDTF, UDAF
UDF: User Defined Function, one in one out
UDAF: user defined aggregation function
UDTF: user-defined table generating functions, which is used to solve the problem of inputting one row and outputting more than one row
UDF function class
In fact, our common operators, such as map and filter, have exposed the corresponding interfaces, which can be customized:
For example, map:
val StuDS: DataStream[Stu] = kafkaDS.map( //Internally, we can customize the implementation of MapFunction to achieve type conversion new MapFunction[ObjectNode, Stu]() { override def map(value: ObjectNode): Stu = { JSON.parseObject(value.get("value").toString, classOf[Stu]) } } )
Rich Functions
In addition to the above functions, Rich Functions are also used frequently. All Flink function classes have their Rich versions. It is different from normal functions in that it can get the context of the running environment and has some life cycle methods, such as open, close, getRuntimeContext, and setRuntimeContext, so it can realize more complex functions, such as accumulators and calculators.
Let's implement an accumulator
Accumulator is a simple structure with addition operation and final accumulation result, which can be used after the end of the job. The simplest accumulator is a counter: you can use the Accumulator.add The (V value) method increments it. At the end of the job, Flink summarizes (merges) all partial results and sends them to the client.
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.api.common.accumulators.IntCounter import org.apache.flink.api.common.functions.RichMapFunction import org.apache.flink.configuration.Configuration /** * @description: ${description} * @author: Liu Jun Jun * @create: 2020-06-12 17:55 **/ object AccumulatorTest { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment import org.apache.flink.streaming.api.scala._ val dataDS = env .readTextFile("input/word.txt") //.socketTextStream("bigdata101", 3456) val resultDS: DataStream[String] = dataDS.map(new RichMapFunction[String, String] { //Step 1: define accumulator private val numLines = new IntCounter override def open(parameters: Configuration): Unit = { super.open(parameters) //Register accumulator getRuntimeContext.addAccumulator("num-lines", this.numLines) } override def map(value: String): String = { this.numLines.add(1) value } override def close(): Unit = super.close() }) resultDS.print("Word input") val jobExecutionResult = env.execute("Word statistics") //Number of output words println(jobExecutionResult.getAccumulatorResult("num-lines")) } }
Note: in this case, I use finite flow, because the value of this accumulator can only be printed at the end of the final program, or it can be directly reflected in the Flink UI.
So how to print out the accumulator value at any time? Then we need to customize the accumulator:
And I haven't finished writing the accumulator that implements customization.....
Sink
Then when we finish processing the data through flink, we need to put the result data in the corresponding data storage point, that is, sink, to facilitate the subsequent report statistics through the interface call.
So where is the data?
- ES
- redis
- Hbase
- MYSQL
- kafka
ES sink
As for the introduction of ES, I have also published an article, which is just at the entry level. If you need to see it, you can post the link as follows: Introduction to the latest edition of ES
Come on, post the code. Look at the comments
import org.apache.flink.api.common.functions.RuntimeContext import org.apache.flink.api.common.serialization.SimpleStringSchema import org.apache.flink.streaming.api.datastream.DataStream import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer} import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011 import org.apache.http.HttpHost import org.elasticsearch.action.index.IndexRequest import org.elasticsearch.client.Requests /** * @description: ${description} * @author: Liu Jun Jun * @create: 2020-06-01 11:44 **/ object flink2ES { def main(args: Array[String]): Unit = { // 1. Get execution environment val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment //2. Set parallelism to 2 env.setParallelism(2) //3. Set the configuration, subject, node, consumer group, serialization and consumer offset of kafka data source val topic = "ctm_student" val properties = new java.util.Properties() properties.setProperty("bootstrap.servers", "bigdata101:9092") properties.setProperty("group.id", "consumer-group") properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") properties.setProperty("auto.offset.reset", "latest") // Getting data from kafka val kafkaDS: DataStream[String] = env.addSource( new FlinkKafkaConsumer011[String]( topic, new SimpleStringSchema(), properties) ) //Add ES connection val httpHosts = new java.util.ArrayList[HttpHost]() httpHosts.add(new HttpHost("bigdata101", 9200)) //Create an ESSink object in which to manipulate the data val esSinkBuilder = new ElasticsearchSink.Builder[String]( httpHosts, new ElasticsearchSinkFunction[String] { def createIndexRequest(element: String): IndexRequest = { val json = new java.util.HashMap[String, String] json.put("data", element) return Requests.indexRequest() .index("ws") .`type`("readingData") .source(json) } //Rewrite the process method to process the input data. runtimeContext is the context environment, and requestIndexer is the object for operating index override def process(t: String, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = { //In the add method, the parameters can be: add, delete, modify, and action request requestIndexer.add(createIndexRequest(t)) println("saved successfully") } }) //When testing, this code must be written //Otherwise, kafka produces several pieces of data, but it can't find them in ES. By default, 5000 messages are refreshed once, which involves the refresh rate of indexes in ES architecture. The corresponding configuration will be written below. esSinkBuilder.setBulkFlushMaxActions(1) //Send data to ES kafkaDS.addSink(esSinkBuilder.build()) //Trigger execution env.execute() } }
In the above code, we initially get the data through kafka, and then write it directly to es. But we only simulate the execution of a single index request. In our daily production, we certainly don't mean to refresh the request once. This is too much pressure for ES, so there will be a configuration of mass submission.
- bulk.flush.max.actions: the maximum number of operations to buffer before refreshing.
-
bulk.flush.max.size.mb : the maximum size (in megabytes) of data to be buffered before flushing.
- bulk.flush.interval.ms: refresh interval, regardless of the number or size of buffering operations.
For the current version of ES, configuration is also supported to retry temporary request errors:
-
bulk.flush.backoff.enable: whether to retry EsRejectedExecutionException for delayed backoff if one or more operations of refresh fail due to temporary reasons.
- bulk.flush.backoff.type: the type of backoff delay, which can be CONSTANT or explicit
- bulk.flush.backoff.delay: the amount of delay. For constant backoff, this is just a delay between retries. For exponential compensation, this is the initial reference delay.
- bulk.flush.backoff.retries: number of backoff retries attempted
Redis Sink
You should be familiar with redis. Let's simulate that the data will be saved into redis after the data processing. I'm simulating a single point of redis here.
Note: because we need to access remote redis in IDE, redis's redis.conf In the configuration file, two places need to be modified:
- Note bind 127.0.0.1, otherwise you can only install the native connection of redis, and other machines cannot access it
- Turn off protection mode
protected-mode no
Then you can start redis. Here, you can post some simple commands to complete the service. Ha ha
- Start redis service: redis server / usr / local / redis/ redis.conf
- Enter redis:
Command entered: redis cli
Specify IP: redis cli - H master102
Redis Chinese display problem: redis cli - H master - raw
When multiple Redis are started at the same time, you need to specify the port number to access the Redis cli-p port number
- To shut down the redis service:
1) Single instance close
If it has not been accessed by the client, you can directly redis cli shutdown
If you have entered the client, just shut down
2) Multi instance shutdown
Specify port shutdown redis cli - P port number shutdown
import org.apache.flink.api.common.serialization.SimpleStringSchema import org.apache.flink.streaming.api.datastream.DataStream import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011 import org.apache.flink.streaming.connectors.redis.RedisSink import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper} /** * @description: ${Connect single node redis test} * @author: Liu Jun Jun * @create: 2020-06-12 11:23 **/ object flink2Redis { def main(args: Array[String]): Unit = { // transformation val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(2) val topic = "test1" val properties = new java.util.Properties() properties.setProperty("bootstrap.servers", "bigdata101:9092") properties.setProperty("group.id", "consumer-group") properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer") properties.setProperty("auto.offset.reset", "latest") // Getting data from kafka val kafkaDS: DataStream[String] = env.addSource( new FlinkKafkaConsumer011[String]( topic, new SimpleStringSchema(), properties) ) kafkaDS.print("data:") val conf = new FlinkJedisPoolConfig.Builder().setHost("bigdata103").setPort(6379).build() kafkaDS.addSink(new RedisSink[String](conf, new RedisMapper[String] { override def getCommandDescription: RedisCommandDescription = { new RedisCommandDescription(RedisCommand.HSET,"sensor") } override def getKeyFromData(t: String): String = { t.split(",")(0) } override def getValueFromData(t: String): String = { t.split(",")(1) } })) env.execute() } }
flink to Hbase
Sometimes, you need to write the data processed by Flink to Hbase. Let's try it.
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment /** * @description: ${flink to Hbase} * @author: Liu Jun Jun * @create: 2020-05-29 17:53 **/ object flink2Hbase { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment import org.apache.flink.streaming.api.scala._ val stuDS = env .socketTextStream("bigdata101",3456) .map(s => { //Here I create a student sample class, only name and age val stu: Array[String] = s.split(",") student(stu(0),stu(1).toInt) }) //The simulation here is relatively simple. It does not process the data, but directly writes it to Hbase. This HBaseSink is a class written by itself. Let's look down val hBaseSink: HBaseSink = new HBaseSink("WordCount","info1") stuDS.addSink(hBaseSink) env.execute("app") } } case class student(name : String,age : Int) /** * @description: ${Encapsulate Hbase connection} * @author: Liu Jun Jun * @create: 2020-06-01 14:41 **/ import org.apache.flink.configuration.Configuration import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction} import org.apache.hadoop.hbase.{HBaseConfiguration, HConstants, TableName} import org.apache.hadoop.hbase.client._ import org.apache.hadoop.hbase.util.Bytes class HBaseSink(tableName: String, family: String) extends RichSinkFunction[student] { var conn: Connection = _ //Create connection override def open(parameters: Configuration): Unit = { conn = HbaseFactoryUtil.getConn() } //call override def invoke(value: student): Unit = { val t: Table = conn.getTable(TableName.valueOf(tableName)) val put: Put = new Put(Bytes.toBytes(value.age)) put.addColumn(Bytes.toBytes(family), Bytes.toBytes("name"), Bytes.toBytes(value.name)) put.addColumn(Bytes.toBytes(family), Bytes.toBytes("age"), Bytes.toBytes(value.age)) t.put(put) t.close() } override def close(): Unit = { } }
Well, we've almost finished some basic flow processing API s of flink here, but we haven't talked about the key points of flink, such as window, precise one-time, state programming, time semantics, etc., so the next article on flink will talk about these key points.