Dynamic Output (ToB Massive Log Conversion Service) - Aliyun HybridDB for PostgreSQL Best Practice

Keywords: PostgreSQL github SQL

Label

PostgreSQL, UDF, Dynamic Format, Format, JOIN, OSS External Table

background

Some businesses need to categorize data into dynamic outputs, such as some common log services, where all users'logs are recorded in a unified format, but each end user has different fields of interest, and even different needs for data conversion for each user.

For example, this business:

"Real-time analysis of daily increment trillion + level, data regularization - Aliyun HybridDB for PostgreSQL best practices"

I. demand

1. Different formats can be output according to the definition of ToB users.

2. Each ToB user writes to one or more files.

3. One file cannot contain two users'content.

For other needs, see:

"Real-time analysis of daily increment trillion + level, data regularization - Aliyun HybridDB for PostgreSQL best practices"

II. Architectural Design

1. Use OSS to store massive public logs loaded in real time.

2. The OSS external table interface of HybridDB for PostgreSQL or RDS PostgreSQL is used to read OSS files directly and in parallel.

3. By the window function of HDB PG, press ToB ID fully.

4. Through UDF, the format of public log is transformed into corresponding format according to the UDF corresponding to ToB ID.

5. Write the converted data into OSS. Automatically switch by ToB ID, absolutely guaranteeing each ToB user to write to one or more files. One file does not show the contents of two users.

The above functions are unique to Aliyun HybridDB for PostgreSQL or RDS PostgreSQL.

DEMO and Performance

Here we introduce how to transform data dynamically. For more information, see the case.

"Real-time analysis of daily increment trillion + level, data regularization - Aliyun HybridDB for PostgreSQL best practices"

1. Create a common log table

create table t1 (tid int, c1 text, c2 text, c3 int, c4 timestamp, c5 numeric);  

2. Write a batch of test data

insert into t1 select random()*100, md5(random()::text), 'test', random()*10000, clock_timestamp(), random() from generate_series(1,1000000);  

3. Create UDF meta-information table and store the UDF name corresponding to each ToB ID.

create table t2(tid int, udf name);  

4. To create UDF, you need to customize the ToB ID format and create the corresponding UDF.

create or replace function f1(t1) returns text as $$  
  select format('tid: %L , c2: %L , c4: %L', $1.tid, $1.c2, $1.c4);  
$$ language sql strict;  
  
  
create or replace function f2(t1) returns text as $$  
declare  
  res text := format('%L , %L , %L, %L', $1.tid, upper($1.c2), $1.c4, $1.c3);  
begin  
  return res;  
end;  
$$ language plpgsql strict;  

5. Create a dynamic UDF and call the corresponding UDF dynamically according to the input.

create or replace function ff(t1, name) returns text as $$  
declare  
  sql text := format('select %I(%L)', $2, $1);  
  res text;  
begin  
  execute sql into res;  
  return res;  
end;  
$$ language plpgsql strict;  

6. Write UDF mapping, such as ID of 1-100, use F1 for conversion, ID of 0 uses F2 for conversion.

insert into t2 select generate_series(1,100), 'f1';  
insert into t2 values (0, 'f2');  

7. Dynamic conversion queries are as follows:

postgres=# select ff(t1, t2.udf) from t1 join t2 on (t1.tid=t2.tid) where t1.tid=0 limit 10;  
                         ff                            
-----------------------------------------------------  
 '0' , 'TEST' , '2017-08-03 18:55:00.48512', '9478'  
 '0' , 'TEST' , '2017-08-03 18:55:00.486426', '9352'  
 '0' , 'TEST' , '2017-08-03 18:55:00.487297', '4026'  
 '0' , 'TEST' , '2017-08-03 18:55:00.488419', '736'  
 '0' , 'TEST' , '2017-08-03 18:55:00.491082', '4334'  
 '0' , 'TEST' , '2017-08-03 18:55:00.491097', '2394'  
 '0' , 'TEST' , '2017-08-03 18:55:00.491839', '2076'  
 '0' , 'TEST' , '2017-08-03 18:55:00.492648', '9935'  
 '0' , 'TEST' , '2017-08-03 18:55:00.493505', '383'  
 '0' , 'TEST' , '2017-08-03 18:55:00.493874', '8546'  
(10 rows)  
  
postgres=# select ff(t1, t2.udf) from t1 join t2 on (t1.tid=t2.tid) where t1.tid=1 limit 10;  
                            ff                              
----------------------------------------------------------  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.484799'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.485209'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.485276'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.485744'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.48582'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.485967'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.486067'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.486281'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.486756'  
 tid: '1' , c2: 'TEST' , c4: '2017-08-03 18:55:00.487714'  
(10 rows)  

8. Compulsory full development in HDB PG:

postgres=# select ff(t1, t2.udf) as ff, row_number() over (partition by t1.tid order by c4) from t1 join t2 on (t1.tid=t2.tid) where t1.tid=0 limit 10;
                         ff                          | row_number 
-----------------------------------------------------+------------
 '0' , 'TEST' , '2017-08-03 18:55:00.48512', '9478'  |          1
 '0' , 'TEST' , '2017-08-03 18:55:00.486426', '9352' |          2
 '0' , 'TEST' , '2017-08-03 18:55:00.487297', '4026' |          3
 '0' , 'TEST' , '2017-08-03 18:55:00.488419', '736'  |          4
 '0' , 'TEST' , '2017-08-03 18:55:00.491082', '4334' |          5
 '0' , 'TEST' , '2017-08-03 18:55:00.491097', '2394' |          6
 '0' , 'TEST' , '2017-08-03 18:55:00.491839', '2076' |          7
 '0' , 'TEST' , '2017-08-03 18:55:00.492648', '9935' |          8
 '0' , 'TEST' , '2017-08-03 18:55:00.493505', '383'  |          9
 '0' , 'TEST' , '2017-08-03 18:55:00.493874', '8546' |         10
(10 rows)

IV. Technical Points

Here I will only talk about the technical points involved in this article.

1,UDF

PostgreSQL supports a variety of UDF languages (such as C, plpgsql, sql, plpython, pljava, plv8,...), and users define formats to be converted through UDF.

2. Dynamic Call

Through dynamic invocation, users can dynamically invoke the corresponding UDF to generate different formats in a request.

V. Cloud end products

Aliyun RDS PostgreSQL

Aliyun HybridDB for PostgreSQL

Aliyun OSS

6. Similar scenarios and cases

"Real-time analysis of daily increment trillion + level, data regularization - Aliyun HybridDB for PostgreSQL best practices"

Seven, summary

In some common log services, all users'logs are recorded in a uniform format, but each end user has different fields of interest, and even different needs for data conversion for each user.

PostgreSQL supports a variety of UDF languages (such as C, plpgsql, sql, plpython, pljava, plv8,...), and users define formats to be converted through UDF.

Through dynamic invocation, users can dynamically invoke the corresponding UDF to generate different formats in a request.

Eight, reference

"Real-time analysis of daily increment trillion + level, data regularization - Aliyun HybridDB for PostgreSQL best practices"

Posted by LarryK on Fri, 14 Dec 2018 07:18:03 -0800