Flink of big data learning

Keywords: Spark Apache jvm Scala

Catalog

 

1: Introduction

2: Why Flink

3: What industries need

4: Features of Flink

5: The difference with sparkStreaming

6: Preliminary development

7: Flink configuration description

Eight: Environment

9: Running components

1: Introduction

Flink is a framework and distributed computing engine for state computation of unbounded and bounded data flows

 

2: Why Flink

Streaming data more realistically reflects our lifestyle

Traditional data architecture is based on limited data set

Low latency, high throughput, accuracy of results and good fault tolerance

 

3: What industries need

E-commerce and marketing: data reporting, advertising, business process needs

Internet of things: sensor real-time data acquisition and display, real-time alarm, transportation industry

Telecommunication industry: base station traffic allocation

Banking and Finance: real time settlement and notification push, real-time monitoring of abnormal behavior

 

4: Features of Flink

event driven

Flow based world view (a batch of data is considered as bounded flow)

Provides a layered API

Support and handle event events

Guarantee of state consistency of exact once

Low latency

Can be connected to many common storage systems

High availability, dynamic expansion

 

5: The difference with sparkStreaming

1. Flink is a real stream calculation, while SparkStreaming divides data into batches of data, so Flink needs to be fast with low latency

2. The data model is different. Spark adopts RDD model. In fact, spark streaming DStream is a collection of a group of small batch of RDD data. Flink is based on the data model of flow data and event sequence

3. The architecture of runtime is different. spark is batch computing. You can calculate the next Flink only after you divide DAG into different stage s. The next Flink is the standard flow execution mode. An event can be sent directly to the next node for processing after one node finishes processing

 

6: Preliminary development

Add dependency:

<dependency>

<groupId>org.apache.flink</groupId>

<artifactId>flink-scala_2.11</artifactId>

<version>1.7.2</version>

</dependency>


<dependency>

<groupId>org.apache.flink</groupId>

<artifactId>flink-streaming-scala_2.11</artifactId>

<version>1.7.2</version>

</dependency>

The batch processing code using Flink is as follows:
 

val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

val inputPath = "C:\\hnn\\Project\\Self\\spark\\in\\hello.txt"

//Read file contents

val inputDateSet: DataSet[String] = env.readTextFile(inputPath)

val words = inputDateSet.flatMap(_.split(" "))

.map((_, 1))

.groupBy(0)

.sum(1)

words.print()

The code of using Flink for stream processing is as follows: (you need to start the nc (netCat) service locally to simulate the generation of data stream)

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

//Receive a sock text stream

val dataStream: DataStream[String] = env.socketTextStream("localhost", 44444)

val result: DataStream[(String, Int)] = dataStream.flatMap(_.split(" "))

.filter(_.contains("a")) //Filter all words with a

.map((_, 1)) //Recombinant map

.keyBy(0) // Equivalent to groupBy

.sum(1)

result.print()

env.execute()

7: Flink configuration description

#jobmanager ip address

jobmanager.rpc.address: localhost

# jobmanager port number

jobmanager.rpc.port: 6123

# JobManager JVM heap size

jobmanager.heap.size: 1024m

# TaskManager JVM heap size

taskmanager.heap.size: 1024m

# Number of task slots provided by each task manager

taskmanager.numberOfTaskSlots: 1

# Actual parallelism

parallelism.default: 1

 

Eight: Environment

(1) Download Flink note that the version of scala you download should be consistent with your code

(2) Start Flink: start-cluster.bat

(3) Call Flink's web page: the default is localhost:8081

(4) Prepare netcat startup command: nc -L -p 44444

(5) Package project, upload jar, add configuration, submit sumit task

9: Running components

jobManage

Job manager, receiving the application to execute (the process of job submission) is a process

taskManage

Task manager, also a process, has one or more slots. Each taskmanager contains a certain number of slots. The number of slots limits the number of tasks that can be performed by taskmanager. After startup, taskmanager will register its slots with the resource manager. After receiving the instructions from the resource manager, taskmanager will provide one or more slots to the job manager for calling , jobmanage will assign tasks to the slots to perform

resourceManage

Resource manager, which manages slots

dispatcher

Dispenser

 

 

 

Published 14 original articles, won praise 13, visited 862
Private letter follow

Posted by stodge on Fri, 17 Jan 2020 01:18:24 -0800