Flink practical tutorial - Advanced: dimension table Association

Keywords: flink

Author: Tencent cloud flow computing Oceanus team

Introduction to flow computing Oceanus

Stream computing Oceanus is a powerful tool for real-time analysis of big data product ecosystem. It is an enterprise level real-time big data analysis platform based on Apache Flink with the characteristics of one-stop development, seamless connection, sub second delay, low cost, security and stability. Stream computing Oceanus aims to maximize the value of enterprise data and accelerate the construction process of real-time digitization of enterprises.
This article will introduce in detail how to extract MySQL data and HBase data for dimension table Association (flow dimension join), and store them in elastic search after simple aggregation analysis.

Operation video

Pre preparation

Create flow computing Oceanus cluster

Enter the flow calculation Oceanus console [1], click cluster management on the left, and click Create cluster on the top left. For details, please refer to the flow calculation Oceanus official document to create an exclusive cluster [2].

Create MySQL instance

Enter MySQL console [3] and click new. For details, please refer to the official document to create a MySQL instance [4].

Data preparation:

Enter the instance database and create oceanus_advanced1_student_grade table and insert data manually.

-- Create table statement CREATE TABLE `oceanus_advanced1_student_grade` (  `name`    varchar(50) NOT NULL DEFAULT '',  `grade`   int(3)               DEFAULT NULL,  PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8-- Data insertion INSERT INTO `oceanus_advanced1_student_grade` (`name`, `grade`) VALUES ('Oceanus-1', 85);INSERT INTO `oceanus_advanced1_student_grade` (`name`, `grade`) VALUES ('Oceanus-2', 95)

 

Create EMR cluster

Log in to the elastic MapReduce console [5], select [cluster list] > [new cluster] to start creating a new cluster. For details, refer to creating an EMR cluster [6]. When creating a new cluster, you need to choose to install HBase components.

Data preparation:

Log in to the EMR cluster node, enter the HBase instance database through the HBase Shell command, create a new table, and insert data manually.

# get into HBase command root@yourhostname~# hbase shell
-- Create table statement create 'oceanus_advanced1_student_info','StuInfo'-- Data insertion put 'oceanus_advanced1_student_info','Oceanus-1','StuInfo:Class','01'put 'oceanus_advanced1_student_info','Oceanus-1','StuInfo:Age','17'put 'oceanus_advanced1_student_info','Oceanus-2','StuInfo:Class','01'put 'oceanus_advanced1_student_info','Oceanus-2','StuInfo:Age','20'put 'oceanus_advanced1_student_info','Oceanus-3','StuInfo:Class','01'put 'oceanus_advanced1_student_info','Oceanus-3','StuInfo:Age','18'

Create Elasticsearch cluster

Enter Elasticsearch console [7], click [new] on the upper left to create an Elasticsearch instance. For specific operations, please visit create Elasticsearch cluster [8]

Flow calculation Oceanus job

1. Create Source

CREATE TABLE `mysql_cdc_source_table` (  `name`        STRING,  `grade`       STRING,  `proc_time`   AS PROCTIME(),      -- here proc_time Fields match the following flow dimensions join When used.  PRIMARY KEY (`name`) NOT ENFORCED -- If the database table to be synchronized has a primary key defined, You also need to define it here) WITH (  'connector' = 'mysql-cdc',        -- Fixed value 'mysql-cdc'  'hostname' = '10.0.0.158',        -- Database IP  'port' = '3306',                  -- Access port of the database  'username' = 'root',              -- User name for database access (required) SHOW DATABASES,REPLICATION SLAVE,REPLICATION CLIENT,SELECT and RELOAD Permissions)  'password' = 'Tencent123$',       -- Password for database access  'database-name' = 'testdb',       -- Databases that need to be synchronized  'table-name' = 'oceanus_advanced1_student_grade'   -- Name of data table to be synchronized);

2. Create HBase dimension table

CREATE TABLE hbase_table (  rowkey      STRING,  StuInfo     ROW <Class STRING,Age STRING>,  PRIMARY KEY (rowkey) NOT ENFORCED) WITH (  'connector' = 'hbase-1.4',                         -- Flink 1.13 support hbase-2.2  'table-name' = 'oceanus_advanced1_student_info',   -- HBase Table name  'zookeeper.quorum' = '10.0.0.118:2181,10.0.0.119:2181,10.0.0.3:2181'   -- HBase of zookeeper address);

3. Create Sink

CREATE TABLE elasticsearch6_sink_table (    `class`       STRING,    `amount`      BIGINT,    PRIMARY KEY(`class`) NOT ENFORCED) WITH (    'connector' = 'elasticsearch-6',      -- Output to Elasticsearch 6    'username' = 'elastic',               -- Optional user name    'password' = 'Tencent123$',           -- Optional password    'hosts' = 'http://10.0.0.97:9200 '-- elasticsearch's connection address' Index' ='oceanus'_ Advanced1 ', -- Index name of elasticsearch' Document type '='_ Doc ', -- elasticsearch's Document type' sink. Bulk flush. Max actions' ='1 ', -- each data is refreshed,' format '='json' -- output data format, currently only 'JSON') is supported;

4. Write business SQL

INSERT INTO elasticsearch6_sink_tableSELECTb.StuInfo.Class            AS class,COUNT(a.name)              AS amountFROM mysql_cdc_source_table AS aJOIN hbase_table FOR SYSTEM_TIME AS OF a.proc_time AS b-- You must join here for SYSTEM_TIME as of Statement, otherwise it can still be executed JOIN,However, the database will only be read in full once, and the results may not meet expectations. ON a.name = b.rowkeyWHERE CAST(a.grade AS INT) >= 90 AND CAST(b.StuInfo.Age AS INT) >= 18GROUP BY b.StuInfo.Class

 

summary

Currently, the built-in Connector of the stream computing Oceanus platform supports MySQL, PostgreSQL, Hive, HBase, Redis and Oracle dimension tables, which can be used without user development. Refer to the built-in dimension table reference list [9] for details on how to use dimension tables   This example counts the number of people in each class who are older than or equal to 18 years old and whose grades are greater than or equal to 90 points. There is no actual business purpose.

When creating a Sink table to ES, Flink will treat the primary key class field defined in the CREATE TABLE statement as_ id generates the primary key and updates the previous document (Upsert stream) accordingly. If there is no definition of primary key, Flink will randomly generate a string_ id generates the primary key, which is written for the Append stream.

 

Reference link

[1] Flow calculation Oceanus console: https://console.cloud.tencent.com/oceanus/overview  

[2] Create an exclusive cluster: https://cloud.tencent.com/document/product/849/48298  

[3] MySQL console: https://console.cloud.tencent.com/cdb  

[4] Create MySQL instance: https://cloud.tencent.com/document/product/236/46433  

[5] Elastic MapReduce console: https://console.cloud.tencent.com/emr  

[6] To create an EMR cluster: https://cloud.tencent.com/document/product/589/10981  

[7] Elasticsearch console: https://console.cloud.tencent.com/es  

[8] To create an Elasticsearch cluster: https://cloud.tencent.com/document/product/845/19536  

[9] Built in dimension table reference list: https://cloud.tencent.com/document/product/849/48264

Flow computing Oceanus   Limited second kill exclusive activity is hot ↓

 

 

Focus on Tencent public data, official account, technical exchange, latest activities, service exclusive Get~

 

Posted by jassh on Sat, 04 Dec 2021 13:49:59 -0800