Kettle implements cross-database index calculation

Background note
Target table structure
Create test tables and test data
Query statement
Delete data
Kettle transformation

Delete data
Daily value
Yesterday value
Last month's value
sort
Relation
Cumulative calculation
Output Target Table

summary

Background note

Business requires daily statistics of the current value of business indicators, as well as the calculation of the index's yesterday value, the same period last month value, and the cumulative monthly value/annual value.

Considering the convenience of using the table, the design index table produces an index record every day, at the same time, the field is used to record the index corresponding to yesterday's value and other values.

At the same time, the unnecessary statistical data is reduced. When calculating the cumulative value, the cumulative value of yesterday will be taken from the target and added to the cumulative value of that day to obtain the cumulative value of that day.

Because the target table is not in the same database when calculating the index value of that day, we can not use a single SQL to query, so we need to associate the result set across the database.

So the difficulty in developing this job is the association between databases.

Target table structure

Column names	Column description	Remarks
day_id	date
area_code	region
kpi_code	index
kpi_value	Indicators of the day
kpi_value_ld	Yesterday indicators
kpi_value_lm	Last month's same period value
kpi_value_sm	Cumulative value of the month
kpi_value_sy	Accumulated Value of the Year

Create test tables and test data

Source data table:

drop table if exists test_1;
create table test_1 as
(
select '001' as area_code ,to_date('20190906','yyyymmdd') as create_date ,100 as sum_amount union all
select '001' as area_code ,to_date('20190906','yyyymmdd') as create_date ,200 as sum_amount 
)

Target data sheet:

drop table if exists test_target;
create table test_target as
(
select '20190806' as day_id ,'001' as area_code ,'001' as kpi_code,100 as kpi_value ,100 as kpi_value_ld,100 as kpi_value_lm ,100 as kpi_value_sm,100 as kpi_value_sy union all
select '20190905' as day_id ,'001' as area_code ,'001' as kpi_code,100 as kpi_value ,100 as kpi_value_ld,100 as kpi_value_lm ,100 as kpi_value_sm,200 as kpi_value_sy union all
select '20190906' as day_id ,'001' as area_code ,'001' as kpi_code,200 as kpi_value ,200 as kpi_value_ld,200 as kpi_value_lm ,200 as kpi_value_sm,200 as kpi_value_sy 
)

Query statement

The SQL statement of index query is not complicated. Only job variables need to be used in query, so it is convenient to schedule batches by day.
Date Indicator Value:

select 
'${day_id}' as day_id
,area_code as area_code
,'001' as kpi_code
,sum(sum_amount) as kpi_value
from test_1
where create_date >= to_date('${day_id}','yyyymmdd') and create_date < to_date('${day_id}','yyyymmdd') + interval '1 day'
group by area_code

Yesterday's Indicator Value:

select 
area_code
,kpi_code
,kpi_value as kpi_value_ld
,case when extract(day from to_date('${day_id}','yyyymmdd')) = 1 then 0 else kpi_value_sm end as kpi_value_ldsm
,case when extract(doy from to_date('${day_id}','yyyymmdd')) = 1 then 0 else kpi_value_sy end as kpi_value_ldsy
from test_target 
where day_id = to_char(to_date('${day_id}','yyyymmdd') - interval '1 day','yyyymmdd')

The same period last month:

select 
area_code
,kpi_code
,kpi_value as kpi_value_lm
from test_target 
where day_id = to_char(to_date('${day_id}','yyyymmdd') - interval '1 month','yyyymmdd')

Delete data

In order to allow jobs to run data repeatedly without duplication, it is necessary to delete the data of the day before inserting the data.

delete from test_target
where day_id = '${day_id}'

Kettle transformation

Transform the overall view of the job:

Delete data

Control: Execute database scripts

Emphasis is laid on:

Check variable substitution

Configuration diagram:

Daily value

Control: Table input

Emphasis is laid on:

Check variable substitution
The data source is source.

Configuration diagram:

Yesterday value

Control: Table input

Emphasis is laid on:

Check variable substitution
The data source is target
Writing of Three Cumulative Fields in SQL Statements

Configuration diagram:

Last month's value

Control: Table input

Emphasis is laid on:

Check variable substitution
The data source is target

Configuration diagram:

sort

Because the next step is data association connection, because merging is required before Association connection, this step is needed.

The three sorting steps are similar operations.

Control: Sort Records

Emphasis is laid on:

Specifies a sort as an associated column
Pay attention to field order

Configuration diagram:

Relation

Control: Recordset connection

Emphasis is laid on:

Connection mode
Connection field

Configuration diagram:

Cumulative calculation

The previous record set join operation only performed a left association operation between multiple data sets, and did not calculate the cumulative value of the month and the year we needed.

Control: Calculator

Emphasis is laid on:

Computational formulas
Computational field

In actual development, it is found that:
Add the formula of the second line and drop down field A/field B to select only the result new field of the previous line. The other fields in the previous step cannot be selected.
So write the field name directly instead of selecting.
If you don't want to write directly, add another calculator step.

Configuration diagram:

Output Target Table

Control: Table output

Emphasis is laid on:

Database Connection target
Target table name
Quickly add columns using fetch fields
Delete redundant fields

summary

Through the above settings, we can achieve the single index cross-database association to generate index analysis records.

Implementation results:
You can see that a previous record of the day's error (20190906) has been corrected.
Of course, as a routine scheduling job, the following improvements can be made:

Increase the Kettle job (kjb) as a loop control to call this job to achieve the number of cycles within the date range.
Add a layer of index single-day value table, that is to say, remove the tables of various index analysis fields (cumulative value, synchronous value). So there is no need to associate data sets across databases. The associations between data sets are completely put into database SQL to operate. The development of kettle will be much simpler. This time, we do not want to increase the amount of table data and reduce the number of scheduling tasks. Therefore, the index daily value table is not used.

Posted by Domhnall on Fri, 06 Sep 2019 22:39:54 -0700

Programmer Group

Kettle implements cross-database index calculation