1, Foreword
Recently, we are using Sqoop+Jekins to realize the data transfer between mysql and hive database.
It mainly uses the Import command of sqoop to import mysql data into hive, and uses the export command to export hive data to mysql.
Jekins plays a regular role, executing sh scripts regularly and synchronizing once a day.
Relevant notes are recorded below.
It feels that Sqoop is similar to kettle; However, I heard that Sqoop handles big data migration better than kettle
/* kettle The specific flow direction of data can be specified and can be a storage tool for various data; sqoop Only complete the data transmission from hdfs to relational database or from relational database to hdfs, and ensure the type of transmitted data in the transmission process. Because the bottom layer of sqoop calls mapreduce, the performance of small amount of data will be limited. One hundred thousand, one million. Ten million level kettle has advantages. The 100 million level data sqoop has obvious advantages. */
2, Sqoop part
1.linux login hive database method
The production environment linux uses a keytab secret key file to log in to the hive database.
Examples of commands are as follows:
#!/bin/bash kinit -kt /home/admin/keytab/myuser.keytab myuser beeline -u "jdbc:hive2://xxx.abc.com:10001/;principal=hive/xxx.abc.com@BIGDATA.XXX.COM"
After login, most operation commands are similar to mysql. Some common commands are recorded below:
//Display database show databases; //What tables are displayed show tables; //Select database abcDB use abcDB; //Show user table details desc user; //Displays the table creation statement for the user table show create table user;
2. Preparation
Objective: migrate MySQL data to Hive and Hive data to MySQL.
The software installation process is omitted.
(1) First, Mysql creates a table mytest123
create table 'mytest123' { 'id' int(11) not null, 'name' varchar(255) character set utf8 collate utf8_general_ci NULL default null, 'create_time' datetime Null default Null, Primary Key ('id') } ENGINE=InnoDB DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci ROW_FORMAT=DYNAMIC ;
After creating the table, write a few pieces of data and prepare to migrate the data of this table to Hive.
(2)Hive database creates a table mytest123
Log in to the hive database (for example, use keytab), and then execute the table creation statement.
If the execution fails, it may be a space or newline problem. Try to write it in one line and try again.
drop table if exists test.mytest123; create EXTERNAL table test.mytest123 ( id int comment 'id', name string comment 'full name', create_time decimal(8,0) comment 'Creation time' )comment 'Name list(For self-test)' PARTITIONED BY ('import_time' string) STORED AS ORC;
PARTITIONED BY is equivalent to an additional column of import_time, which partition each row in the table belongs to can be marked. When selecting, the partition can be selected to improve the search efficiency;
STORED AS is a table format.
orcfile: Storage method: data is divided into blocks by row, and each block is stored by column; Compressed fast train access; Efficiency ratio rcfile high,yes rcfile An improved version of.
3.Mysql to Hive
After the table is created, try Mysql synchronous data import Hive first.
It mainly uses the sqoop import statement to give the connection url, account number, password, query sql of mysql database, as well as the database name, table name, key and value for partition (if there are partition columns) and other parameters of hive database, so that the data can be migrated.
Note that the connection, account and password of hive database do not need to be given.
Examples are as follows:
#Receive incoming parameters MY_DATE=$1 MY_SQL="select t1.id as id, t1.name as name, date_format(t1.create_time, '%Y%m%d') as create_time from mytest123 t1 where create_time >="$MY_DATE" and \$CONDITIONS" echo $MY_SQL #mysql JDBC_MYSQL_URL="jdbc:mysql://10.123.123.123:3306/test" JDBC_MYSQL_USERNAME="root" JDBC_MYSQL_PASSWORD="root" #hive DATABASE_NAME="test" TABLE_NAME="mytest123" #Method of actual implementation function execute_sync { sqoop import -D org.apache.sqoop.splitter.allow_text_splitter=true \ --connect $JDBC_MYSQL_URL \ --username $JDBC_MYSQL_USERNAME \ --password $JDBC_MYSQL_PASSWORD \ --query "$MY_SQL" \ --hcatalog-database $DATABASE_NAME \ --hcatalog-table $TABLE_NAME \ --hive-partition-key "import_time" \ --hive-partition-value "MY_DATE" \ --null-string '\\N' \ --null-non-string '\\N' \ -m 1 } #key is the partition column import_time, the corresponding value is $MY_DATE #For each row inserted, the column import_ The value of time will be set to $MY_DATE #implement execute_sync
Then, save the date parameter and execute the sh file to synchronize the data in the test.mytest123 table in mysql to the test.mytest123 table in hive library.
./mytest.sh 20211018
4.Hive to Mysql
Target: synchronize data from Hive's test.mytest123 table to import Mysql's test.mytest456 table.
The mysql table is replaced with the export statement, and sql cannot be written:
sh example is as follows:
#mysql JDBC_MYSQL_URL="jdbc:mysql://10.123.123.123:3306/test" JDBC_MYSQL_USERNAME="root" JDBC_MYSQL_PASSWORD="root" JDBC_MYSQL_TABLE="mytest456" #hive DATABASE_NAME="test" TABLE_NAME="mytest123" #Method of actual implementation function execute_sync { sqoop import -D org.apache.sqoop.splitter.allow_text_splitter=true \ --connect $JDBC_MYSQL_URL \ --username $JDBC_MYSQL_USERNAME \ --password $JDBC_MYSQL_PASSWORD \ --table $JDBC_MYSQL_TABLE \ --update-key id \ --update-mode allowinsert \ --hcatalog-database $DATABASE_NAME \ --hcatalog-table $TABLE_NAME \ --colunms="id,name,create_time" \ --null-string '\\N' \ --null-non-string '\\N' \ -m 1 } #implement execute_sync
be careful:
The mysql table is replaced with mytest456, create_ The time field is changed from datetime to varchar;
Because create in hive_ Time is a decimal type (for example, 20211018). If datetime is also used to receive, no error will be reported, but it will be null, so it should be received by varchar.
from mysql reach hive When, you can write sql,So type conversion is easier to handle; from hive reach mysql,use hcatalog Parameters cannot be written sql,Therefore, type conversion is difficult.
For the type conversion from Hive to mysql, there is a method: put the Hive table data [query it with SQL] into the HDFS temporary directory, then get it locally, and use MySQL load file. This method is obviously troublesome, but SQL supports the processing of table field content very well and is more flexible.
The example of startup command is as follows:
./mytest2.sh
3, Jekins section
1. Function
Here, Jekins is mainly used to execute sh tasks regularly.
The installation steps are omitted, and the use steps are shown below.
2. Use steps
(1) Log in to Jekins
(2) Click "new task" on the left
(3.1) enter a task name, such as mytest, and select "create from an existing task" below, which means to copy an existing task. This method is not demonstrated here.
(3.2) enter the task name, click "build a multi configuration project", check "add to current view", and click "ok"
(4) Enter the General tab to write a description; Select "parameterized construction process" - > "add parameter" - > "Date Parameter", configure a Date Parameter to pass parameters to sh, as follows:
Date Parameter Name: DATE Date Format: yyyyMMdd Default Value: LocalDate.now().minusDays(1) Description: date_param
The default value of the configured DATE parameter is "current time - 1" days.
Click the question mark to see the details.
(5) Advancedproject options. Some advanced configurations are not set here
(6) Source code management. When automatically publishing jar/war packages, you can configure git code path, which is not set here
(7) To build a trigger, select scheduled build to configure the time of scheduled execution, for example, at 3:00 a.m. every day:
H 3 * * *
The commonly used configuration also includes "trigger after construction of other projects".
(8)Configuration Matrix, you can configure mixing parameters, which are not set here.
(9) The build environment is not set here.
(10) Build, click "add build steps" - > "execute shell", and then enter the shell command, for example:
cd /home/admin/shell sh mytest.sh $DATE
This means that when executing mytest.sh, the parameter $DATE will be passed.
(11) Post build operations are not set here.
(12) Click save.
(13) Find the newly created mytest task in the Jenkins task list. Because the timing time is configured, it will be executed at 3:00 every day. The date parameter is the current time - 1 day.
If you want it to be executed immediately, you can click the task, click "Build with Parameters", then modify the DATE parameter of this execution, and click "start construction", and the task will be executed immediately.
4, Other notes
1. The where statement of SQL must add and $CONDITIONS
2. Double quotation marks are required for SQL parameters; After line feed, add \: -- query "$MY_SQL" at the end of the parameter\
3. For data export in hive Parquet+Snappy format, use hcatalog parameter. If it is not used, you can use another method to solve it. Query the Hive table data and put it in the HDFS temporary directory, then get it locally, and use MySQL load file. This method is obviously troublesome, but SQL supports better and more flexible processing of table field content.