03-PDI(Kettle) import and export CDC

Keywords: etl

03-PDI(Kettle) import and export CDC

Full, incremental and differential backup of data

Reference blog:
https://blog.csdn.net/qq_38097573/article/details/103593150

Far away, far away, there is a cashier.
He has to write down a lot of bills every day.
The old man was cautious all his life. In order to ensure the safety of the account book,
He asked three disciples to help back up the ledger,
So even if the ledger is lost,
You can also continue to use the backed up ledger.
The three disciples have their own strengths and adopt different methods:
Big apprentice ▼
He is kind-hearted, mature and steady.
He copied master's bill again every day. The advantage of this is that every day is a complete account book, and every backup account book can be used directly. The disadvantage is that it takes a lot of time to record every day, and it requires a lot of paper, ink and a cabinet to store the account book.
Second apprentice ▼
He has a bright personality and is a big kid.
She felt that the eldest martial sister's method was too tired and time-consuming. It was better to only copy the new information in the account book every day. In this way, she was always the first to copy every day, which not only saved time, but also saved a lot of paper, ink and other costs for the store. However, she hated master to check the ledger, so she needed to piece together the data recorded each time to form a complete ledger.
Three apprentices ▼
The character is meticulous and meticulous
She learned from the methods of the two elder martial sisters and made a compromise and innovation. Specifically, she copied all the information in the account book every other period of time, and during this period, she only recorded the information that changed every day compared with the "general ledger". For example, she copied a general ledger on Monday, spent 100 yuan on buying vegetables on Tuesday, and recorded "100 yuan on buying vegetables" on Wednesday; It cost 50 yuan to buy meat on Wednesday. On Thursday, it recorded "50 yuan to buy meat" on Wednesday and "100 yuan to buy vegetables" on Tuesday. If master wants to check the accounts, she puts the newly recorded data together with the "general ledger", which is a complete account book. This method is a compromise in terms of bookkeeping speed, paper and ink usage, Bill viewing, etc. Oh, by the way, an obvious deficiency is that you need to type more words to explain!

The three methods of the three apprentices have their own advantages. Later, in order to facilitate memory, the cashier named the three methods: full backup, incremental backup and differential backup.

Time stamp based source data CDC

Experimental principle

Identify the changed data from the timestamp and import only this part of the data. According to CDC_ time_ The last execution time in the log table and the entered current execution time are used to export students incrementally_ Data in CDC table. The output data is stored in XX/student_cdc.xls file. Among them, CDC_ time_ The main function of the log table is to record the last execution time. The data between the current execution time and the last execution time is incremental data. After the pull is successful, the CDC needs to be deleted_ time_ The last execution time in the log table is updated to the current execution time. In this way, the CDC operation can continue.

Experimental steps

  1. Create student_cdc table.

Execute student on the mysql command line_ Cdc.sql script

student_cdc.sql content

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for student_cdc
-- ----------------------------
DROP TABLE IF EXISTS `student_cdc`;
CREATE TABLE `student_cdc`  (
  `Student number` int(11) NOT NULL,
  `full name` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `Gender` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `class` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `Age` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `achievement` int(15) NULL DEFAULT NULL,
  `height` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `mobile phone` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `Insertion time` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `Update time` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`Student number`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

-- ----------------------------
-- Records of student_cdc
-- ----------------------------
INSERT INTO `student_cdc` VALUES (1, 'Zhang Yi', 'male', '1701', '16', 78, '170', '18946554571', '2018-08-06', '2018-08-06');
INSERT INTO `student_cdc` VALUES (2, 'Li Er', 'male', '1701', '17', 80, '175', '18946554572', '2018-08-06', '2018-08-06');
INSERT INTO `student_cdc` VALUES (3, 'Thankson ', 'male', '1702', '18', 95, '169', '18946554573', '2018-08-06', '2018-08-06');
INSERT INTO `student_cdc` VALUES (4, 'Zhao Ling', 'female', '1702', '19', 86, '180', '18956257895', '2018-08-06', '2018-08-06');
INSERT INTO `student_cdc` VALUES (5, 'Zhang Ming', 'male', '1704', '20', 85, '185', '18946554575', '2018-08-07', '2018-08-07');
INSERT INTO `student_cdc` VALUES (6, 'Zhang San', 'female', '1704', '18', 92, '169', '18946554576', '2018-08-06', '2018-08-07');

SET FOREIGN_KEY_CHECKS = 1;

Execute CDC on the mysql command line_ time_ Log.sql script.

cdc_time_log.sql content

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for cdc_time_log
-- ----------------------------
DROP TABLE IF EXISTS `cdc_time_log`;
CREATE TABLE `cdc_time_log`  (
  `id` int(11) NOT NULL,
  `Last execution time` varchar(45) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = latin1 COLLATE = latin1_swedish_ci ROW_FORMAT = Dynamic;

-- ----------------------------
-- Records of cdc_time_log
-- ----------------------------
INSERT INTO `cdc_time_log` VALUES (1, '2018-08-04');

SET FOREIGN_KEY_CHECKS = 1;

2. Create and design transformation.

(1) The design conversion is as follows

(2) Naming parameter settings: right click in the blank space of conversion and select conversion settings.

Configure the name of the named parameter (cur_time) and the default value ("2018-08-04") in the named parameter tab "named parameters".


(3) Configuration of "table imput":

Named "CDC log table input"

Establish a database connection and set it in the option of database connection

characterEncoding by utf-8

The SQL query statement is:

SELECT Last execution time as last1,'${cur_time}' as cur1,Last execution time as last2,'${cur_time}' as cur2 FROM cdc_time_log

(4) "table input2" setting
The step is named "student table input"
The Sql statement is:

SELECT Student number, full name, Gender, class, Age, achievement, height, mobile phone, Insertion time, Update time, '${cur_time}' as Import time FROM student_cdc WHERE (Insertion time>=? and Insertion time<=?) OR (Update time>? and Update time<=?)


(5) Set the "Microsoft Excel output" step, and set the file output path to "e:/output/student_cdc".

(6) Insert / update step settings

Since the previous part of insert/update is excel, extra fields will appear after you select get update fields. You can configure the corresponding mapping relationship through edit mapping. Where table field is the field in the current table and stream field is the field in the flow of the previous step.

3. Run

(1) In the pop-up dialog box, set the named parameter cur_ The value of time is "2018-08-06". Click the "start" button to output the name student in the path / home/ubuntu_ Cdc.xls file, CDC_ time_ The contents in the log table will be automatically changed to the entered parameter values.

The output content of excel is:

(2) For the second operation, modify the parameter to "2018-08-07".

The output content of excel is:
cdc_ time_ The log content is updated to: August 7, 2018

Trigger based CDC

Experimental principle

Similar to the CDC operation of timestamp and primary key sequence, the difference is that triggers are used to generate incremental conditions.

Experimental steps

  1. Create the required table.

The sql statement is as follows:
Note: this code is a general SQL statement and does not need to be executed. This statement will be explained step by step below. The reader can execute the statements in step by step

create table studentsync like studentinfo;

-- desc studentsync;
ALTER TABLE studentsync MODIFY createtimestamp datetime;
ALTER TABLE studentsync MODIFY modifytimestamp datetime;
ALTER TABLE studentsync ADD synchronizationtime datetime;


-- Create operation log table
DROP TABLE IF EXISTS cdc_opt_log;
create table cdc_opt_log 
(
	SID int PRIMARY KEY,
	optype char(1),
	opflag char(6)
);

-- establish INSERT trigger
DROP TRIGGER IF EXISTS tri_insert_student;
DELIMITER //
CREATE TRIGGER tri_insert_student AFTER INSERT ON studentinfo FOR EACH ROW 
begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=new.ID) THEN 
		UPDATE cdc_opt_log SET optype='I',opflag='Untreated' WHERE SID=new.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (new.ID, 'I', 'Untreated');
	END IF;
end
//
DELIMITER ;

-- establish UPDATE trigger
DROP TRIGGER IF EXISTS tri_update_student;
DELIMITER //
CREATE TRIGGER tri_update_student AFTER UPDATE ON studentinfo FOR EACH ROW 
begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=new.ID) THEN 
		UPDATE cdc_opt_log SET optype='U',opflag='Untreated' WHERE SID=new.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (new.ID, 'U', 'Untreated');
	END IF;
end
//
DELIMITER ;

-- establish DELETE trigger 
DROP TRIGGER IF EXISTS tri_delete_student;
DELIMITER //
CREATE TRIGGER tri_delete_student AFTER DELETE ON studentinfo FOR EACH ROW 
begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=old.ID) THEN 
		UPDATE cdc_opt_log SET optype='D',opflag='Untreated' WHERE SID=old.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (old.ID, 'D', 'Untreated');
	END IF;
end
//
DELIMITER ;


-- Trigger insertion
-- INSERT INTO studentinfo (ID,Name,Gender,Class,Age,Score,Height,PhoneNumber) VALUES (10,'Shan Zhang','M','1704',18,91,171,'19946554576');

-- Trigger delete
-- DELETE FROM studentinfo WHERE Name='Shan Zhang';

-- Trigger update
-- UPDATE studentinfo SET Score=92 WHERE ID=1;

-- Empty operation log table
-- TRUNCATE cdc_opt_log;
  1. Create INSERT trigger

Create INSERT trigger tri_insert_student

DROP TRIGGER IF EXISTS tri_insert_student;
DELIMITER //
CREATE TRIGGER tri_insert_student AFTER INSERT ON studentinfo FOR EACH ROW 
begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=new.ID) THEN 
		UPDATE cdc_opt_log SET optype='I',opflag='Untreated' WHERE SID=new.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (new.ID, 'I', 'Untreated');
	END IF;
end
//
DELIMITER ;

After a record is inserted into the studentinfo table, the trigger will execute to the cdc_opt_log to update or insert a record.
This paragraph is explained as follows:

DROP TRIGGER IF EXISTS tri_insert_student;

Indicates a delete trigger

DELIMITER / / means to modify the DELIMITER to / / to avoid encountering semicolons in MySQL; Execute immediately

CREATE TRIGGER tri_insert_student AFTER INSERT ON studentinfo FOR EACH ROW indicates that the trigger created is tri_insert_student ,
AFTER after AFTER indicates execution AFTER insertion. BEFORE is optional,
The following INSERT indicates that the trigger executes when inserting,
ON studentinfo means that the trigger is defined in a table,
FOR EACH ROW means that each row will be triggered.

begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=new.ID) THEN 
		UPDATE cdc_opt_log SET optype='I',opflag='Untreated' WHERE SID=new.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (new.ID, 'I', 'Untreated');
	END IF;
end

The statement representing the trigger, where new is the supported keyword, represents the modified row, and old represents the row before modification.

DELIMITER ; Indicates that the delimiter is modified back to the default;.

  1. Create UPDATE trigger
    Create INSERT trigger tri_update_student
DROP TRIGGER IF EXISTS tri_update_student;
DELIMITER //
CREATE TRIGGER tri_update_student AFTER UPDATE ON studentinfo FOR EACH ROW 
begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=new.ID) THEN 
		UPDATE cdc_opt_log SET optype='U',opflag='Untreated' WHERE SID=new.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (new.ID, 'U', 'Untreated');
	END IF;
end
//
DELIMITER ;

This statement is basically similar to the insert trigger, and the description will not be repeated

  1. Create DELETE trigger

Create INSERT trigger tri_delete_student

DROP TRIGGER IF EXISTS tri_delete_student;
DELIMITER //
CREATE TRIGGER tri_delete_student AFTER DELETE ON studentinfo FOR EACH ROW 
begin
	IF (SELECT 1 FROM cdc_opt_log WHERE SID=old.ID) THEN 
		UPDATE cdc_opt_log SET optype='D',opflag='Untreated' WHERE SID=old.ID;
	ELSE
		INSERT INTO cdc_opt_log(SID, optype, opflag) VALUES (old.ID, 'D', 'Untreated');
	END IF;
end
//
DELIMITER ;
  1. Conversion design

  1. Specific steps

(1)Table Input : GetCDCOptlog

   SELECT
    optype,
    'Completed' as res,
    SID
   FROM cdc_opt_log
   WHERE opflag='Untreated'

(2)Switch Case : IsDeleteOperation

(3)Table Input: GetStudentCDC
sql statement:

   SELECT
     ID
   , Name
   , Gender
   , Class
   , Age
   , Score
   , Height
   , PhoneNumber
   , createtimestamp
   , modifytimestamp
   , ? as optype
   , ? as res
   , CURRENT_TIMESTAMP as loadtimestamp
   FROM studentinfo
   WHERE ID=?

(4)Insert Update: InsertOrUpdateStudentCDCtoSyncTable
This step updates the newly inserted data to studentsync.

(5)Insert Update: UpdateCDCOptLog
This step will the CDC_ opt_ The optflag field in the log table is modified to the res value (completed)

(6)Delete : DeleteStudentFromSyncTable

This step will delete the record with the specified ID in studentsync
(7)Insert / Update : UpdateCDCOptLog2

So far: the conversion is completed. If necessary, please leave a message and I will send the ktr file to the mailbox

Posted by marq on Wed, 10 Nov 2021 23:31:02 -0800