Wang Sicong " Eating " Project - Shared Charging Bao - DB Design and Practice of Management and Sales Analysis System

Keywords: PostgreSQL github GreenPlum SQL

Label

PostgreSQL, Internet of Things, Shared Charging Pool, Business Analysis System, Multi-table Association, Detailed Completion, Banning Association, ltree Tree Type, Parallel Computing, PostgreSQL 10 Black Technology, Sales Management

background

Share rechargeable treasures, bicycles, umbrellas, girlfriends, cars, etc. Shared economy has developed very rapidly in recent years.

Sharing must involve the management of shared objects, members and so on. In fact, it also belongs to an Internet of Things system.

This paper takes the scenario of shared charging treasure as an example to share the business analysis of shared charging treasure and the design of background database of sales management system. (The boss is concerned about the overall sales performance, as well as the perspective of various channels. The sales manager is concerned about the sales performance of the area under his jurisdiction, and the operation and maintenance personnel are concerned about the status of the equipment.)

Data Structure and Data Quantity

What is the business model?

Charging treasure can be seen in hotels, shopping malls, railway stations, foot baths and other places. Each rechargeable treasure will have a relatively fixed location (for example, in the grandmother's restaurant), each fixed location will have a relatively fixed sales (just like the protection fee of the ancient bamboo), and each sales has a fixed superior.

Users borrow charging treasure operation is very simple, users sweep code, order, borrow; some can not be borrowed, then sweep code, order, charge.

(In addition to charging business, in fact, we can also cooperate with merchants to do some user portraits, advertising and promotion business. Of course, the premise is user portraits.)

Data structure abstraction

1. Personnel table (BOSS, Sales Director, Store Manager).

Data volume prediction: 3000 +, rarely updated.

2. Category List (Foot Bath Shop, Hotel, Railway Station, Hotel.)

Data volume estimates: 100 +, rarely updated

3. Store List

Data Volume Prediction: Within Million Level, rarely updated

4. Equipment Table

Data Volume Estimation: Million-level, each device reports a heartbeat every N minutes

5. Order Form

Data Volume Estimation: Million level per day, insert and update each order at least once (create order, pay order, return order, etc.) and the order has final status.

II. Analysis of Needs

1. Real-time analysis requirements:

In addition, the perspective is based on the dimensions of day, month and year, as well as the overall situation, employees, employees at the first level, all employees, categories, stores, equipment and other dimensions.

2. Polymerization index:

The number of new equipment, online equipment, off-line equipment, new orders, turnover orders, refund orders, account flow and so on.

3. Time requirements:

There are statistical requirements for orders on the same day of inquiry, for uniform time points on the same day of inquiry and for the previous day, and they are calculated on a year-on-year basis. There are also monthly and annual needs.

4. Query concurrency:

Query concurrency in analysis systems is usually not too high, because it is used by others. It may not exceed 3000 in a minute.

5. Query timeliness:

Monthly and annual statistics are generated offline every day. (This is recommended, because there is no need to look at the business indicators in real time last month.)

Daily dimension statistics are generated in real time. (Daily data volume is not large, real-time generation, real-time query, can meet the needs of concurrency and response time. It also meets the needs of the business.

Response time requirements: tens of milliseconds.

Concurrent requirements: less than 100.

3. Database Selection

PostgreSQL 10: HTAP database, supporting 10TB OLTP and OLAP hybrid requirements. TP has strong performance and rich functions. Support multi-core parallel computing, HASH JOIN and a series of powerful functions, AP performance is also moderate.

HybridDB for PostgreSQL: PB level, pure analytical database, supporting multi-machine parallel computing. AP performance is strong, but TP performance is very weak.

For more details, please refer to:

"Space | Time | Object Circle People + Perspective - and PostgreSQL 10 and Greenplum's Contrast and Selection"

Which scene to choose in the end? Let's do a DEMO design for both. Compare them.

PostgreSQL 10 Scheme 1

Design table structure

create table a (          -- Employee Hierarchy Information    
  id int primary key,     -- number ID    
  nick name,              -- Name    
  pid int                 -- Superior ID    
);    
    
create table c (          -- Category    
  id int primary key,     -- Category ID    
  comment text            -- Category name    
);    
    
create table b (          -- Terminal store    
  id int primary key,     -- number    
  nick text,              -- Name    
  cid int,                -- Category    
  aid int                 -- store manager ID    
);    
    
create table d (          -- equipment    
  id int primary key,     -- Equipment number    
  bid int,                -- Shop number    
  alive_ts timestamp      -- Heart beat time of equipment    
);    
    
create table log (        -- Order log    
  did int,                -- equipment ID    
  state int2,             -- Order Final Status    
  crt_time timestamp,     -- Order creation time    
  mod_time timestamp      -- Order revision time    
) partition by range (crt_time);    
    
create table log_201701 partition of log for values from ('2017-01-01') to ('2017-02-01') with (parallel_workers =32);     
create table log_201702 partition of log for values from ('2017-02-01') to ('2017-03-01') with (parallel_workers =32);      
create table log_201703 partition of log for values from ('2017-03-01') to ('2017-04-01') with (parallel_workers =32);      
create table log_201704 partition of log for values from ('2017-04-01') to ('2017-05-01') with (parallel_workers =32);      
create table log_201705 partition of log for values from ('2017-05-01') to ('2017-06-01') with (parallel_workers =32);      
create table log_201706 partition of log for values from ('2017-06-01') to ('2017-07-01') with (parallel_workers =32);      
create table log_201707 partition of log for values from ('2017-07-01') to ('2017-08-01') with (parallel_workers =32);      
create table log_201708 partition of log for values from ('2017-08-01') to ('2017-09-01') with (parallel_workers =32);      
create table log_201709 partition of log for values from ('2017-09-01') to ('2017-10-01') with (parallel_workers =32);      
create table log_201710 partition of log for values from ('2017-10-01') to ('2017-11-01') with (parallel_workers =32);      
create table log_201711 partition of log for values from ('2017-11-01') to ('2017-12-01') with (parallel_workers =32);      
create table log_201712 partition of log for values from ('2017-12-01') to ('2018-01-01') with (parallel_workers =32);      
create table log_201801 partition of log for values from ('2018-01-01') to ('2018-02-01') with (parallel_workers =32);    
    
create index idx_log_201701_1 on log_201701 using btree (crt_time) ;    
create index idx_log_201702_1 on log_201702 using btree (crt_time) ;    
create index idx_log_201703_1 on log_201703 using btree (crt_time) ;    
create index idx_log_201704_1 on log_201704 using btree (crt_time) ;    
create index idx_log_201705_1 on log_201705 using btree (crt_time) ;    
create index idx_log_201706_1 on log_201706 using btree (crt_time) ;    
create index idx_log_201707_1 on log_201707 using btree (crt_time) ;    
create index idx_log_201708_1 on log_201708 using btree (crt_time) ;    
create index idx_log_201709_1 on log_201709 using btree (crt_time) ;    
create index idx_log_201710_1 on log_201710 using btree (crt_time) ;    
create index idx_log_201711_1 on log_201711 using btree (crt_time) ;    
create index idx_log_201712_1 on log_201712 using btree (crt_time) ;    
create index idx_log_201801_1 on log_201801 using btree (crt_time) ;    

Initialization data

1. Initialize employee hierarchy (0 is the boss, 1-30 is the sales director, 31-3000 is the store manager). )

do language plpgsql $$    
declare     
begin    
  truncate a;    
  insert into a select generate_series(0,3000);    
  update a set pid=0 where id between 1 and 30;    
  for i in 1..30 loop    
    update a set pid=i where id between 31+100*(i-1) and 31+100*i-1;    
  end loop;    
end;    
$$;    

2. Initialization Category

insert into c select generate_series(1,100);    

3. Initialization of stores

insert into b select generate_series(1,500000), '', ceil(random()*100), 30+ceil(random()*(3000-30));    

4. Initialization equipment

insert into d select generate_series(1,1000000), ceil(random()*500000);    

5. Generate one-year orders, about 365 million, and actually write 378 million (1 million orders per day, 90% payment, 10% refund)

do language plpgsql $$    
declare    
  s date := '2017-01-01';    
  e date := '2017-12-31';    
begin    
  for x in 0..(e-s) loop    
    insert into log     
      select ceil(random()*1000000), case when random()<0.1 then 0 else 1 end, s + x + (i||' second')::interval     
      from generate_series(0,86399) t(i),     
           generate_series(1,12);      -- 12 It's a million days divided by 86400. It's mainly convenient to write test data.      
  end loop;    
end;    
$$;    
  
  
  
postgres=# select count(*) from log;  
   count     
-----------  
 378432001  
(1 row)  

6. Index (Optional Operations, Optimized Items)

It is recommended that real-time data be indexed by btree, static data by BRIN block-level index and static data by BTREE index.

Example

When the order data becomes static historical data, delete the old btree index of the static table and add the following brin index.

create index idx_log_201701_1 on log_201701 using brin (crt_time) ;    
create index idx_log_201702_1 on log_201702 using brin (crt_time) ;    
create index idx_log_201703_1 on log_201703 using brin (crt_time) ;    
create index idx_log_201704_1 on log_201704 using brin (crt_time) ;    
create index idx_log_201705_1 on log_201705 using brin (crt_time) ;    
create index idx_log_201706_1 on log_201706 using brin (crt_time) ;    
create index idx_log_201707_1 on log_201707 using brin (crt_time) ;    
create index idx_log_201708_1 on log_201708 using brin (crt_time) ;    
create index idx_log_201709_1 on log_201709 using brin (crt_time) ;    
create index idx_log_201710_1 on log_201710 using brin (crt_time) ;    
create index idx_log_201711_1 on log_201711 using brin (crt_time) ;    
create index idx_log_201712_1 on log_201712 using brin (crt_time) ;    
create index idx_log_201801_1 on log_201801 using brin (crt_time) ;    

Create the necessary UDF functions

1. Create immutable function to get the current time, the day before yesterday and the year before. (Using the immutable function, the optimizer filters partitions that do not need to be queried. If you want to support parallelism, set it to parallel safe.

create or replace function cdate() returns date as $$    
  select current_date;    
$$ language sql strict immutable PARALLEL safe;    
    
    
create or replace function cts(interval default '0') returns timestamp as $$        
  select (now() - $1)::timestamp;    
$$ language sql strict immutable PARALLEL safe;    

Perspective SQL Design

By person, query all levels of subordinate, related stores, related equipment, related orders.

Output statistics:

1. Aggregate items:

Today's total order, today's payment order, year-on-year total order, year-on-year payment order, year-on-year total order, year-on-year payment order

Total orders closed in the same month, payment orders closed in the same month, total orders closed in the previous month, payment orders closed in the previous month

Total orders for the year ended, payment orders for the year ended, total orders for the year ended, payment orders for the year ended, and payment orders for the year ended.

2. Aggregation dimension:

Full volume, TOP

Category, TOP

Store, TOP

All subordinates, TOP

All subordinates, classes, TOP

All subordinates, stores, TOP

Store Manager, TOP

Store Manager, Category, TOP

Store Manager, Store, TOP

Examples of Perspective SQL Performance Indicators

1. Full fluoroscopy, 32 concurrent, 77 milliseconds.

select t1.cnt, t1.succ_cnt, t2.cnt, t2.succ_cnt from    
(    
  select count(*) cnt, sum(state) succ_cnt from log where crt_time between cdate() and cts()    
) t1,    
(    
  select count(*) cnt, sum(state) succ_cnt from log where crt_time between cdate()-1 and cts(interval '1 day')    
) t2;    
  cnt   | succ_cnt |  cnt   | succ_cnt     
--------+----------+--------+----------    
 796621 |   716974 | 796620 |   716930    
(1 row)    
    
Time: 76.697 ms    

2. Class TOP, 32 concurrent, 446 milliseconds.

select c.id, count(*) cnt, sum(state) succ_cnt from c     
    join b on (c.id=b.cid)     
    join d on (b.id=d.bid)     
    join log on (d.id=log.did)     
  where crt_time between cdate() and cts()    
  group by c.id    
  order by cnt desc limit 10;    
    
 id | cnt  | succ_cnt     
----+------+----------    
 39 | 8369 |     7543    
 70 | 8346 |     7517    
 64 | 8281 |     7488    
 13 | 8249 |     7412    
 29 | 8222 |     7427    
  3 | 8217 |     7370    
 90 | 8200 |     7387    
 79 | 8199 |     7346    
 71 | 8175 |     7348    
 75 | 8169 |     7373    
(10 rows)    
    
Time: 446.977 ms    

3. My total sales (including all subordinates), 464 milliseconds.

with recursive recursive grammar is used here. According to the ID of the current logged-in user, the tree queries all subordinates.

with recursive tmp as (    
  select * from a where id=31                -- Enter my USER ID    
  union all     
  select a.* from a join tmp on (a.pid=tmp.id)     
)    
select count(*) cnt, sum(state) succ_cnt from tmp     
  join b on (tmp.id=b.aid)    
  join d on (b.id=d.bid)    
  join log on (d.id=log.did)    
  where crt_time between cdate() and cts()    
  ;    
 cnt | succ_cnt     
-----+----------    
 296 |      268    
(1 row)    
    
Time: 463.970 ms    

4. My direct subordinate, TOP, 2.6 seconds.

with recursive recursive grammar is used here. According to the ID of the current logged-in user, the tree queries all subordinates.

Regular expressions are also used to aggregate direct subordinates in groups. Get their sales.

with recursive tmp as (               
  select id::text from a where id=0   -- Enter my USER ID      
  union all     
  select tmp.id||'.'||a.id as id from a join tmp on (a.pid=substring(tmp.id, '([\d]+)$')::int)     
)    
select substring(tmp.id, '^[\d]*\.?([\d]+)'), count(*) cnt, sum(state) succ_cnt from tmp     
  join b on (substring(tmp.id, '([\d]+)$')::int=b.aid)    
  join d on (b.id=d.bid)    
  join log on (d.id=log.did)    
  where crt_time between cdate() and cts()    
  group by 1    
  order by cnt desc limit 10    
  ;    
    
   substring |  cnt  | succ_cnt     
-----------+-------+----------    
 15        | 27341 |    24615    
 19        | 27242 |    24500    
 17        | 27190 |    24481    
 26        | 27184 |    24481    
 9         | 27179 |    24466    
 3         | 27157 |    24323    
 6         | 27149 |    24481    
 1         | 27149 |    24402    
 21        | 27141 |    24473    
 12        | 27140 |    24439    
(10 rows)    
    
Time: 2661.556 ms (00:02.662)    

5. All my subordinates (recursive), TOP, 642 milliseconds.

with recursive recursive grammar is used here. According to the ID of the current logged-in user, the tree queries all subordinates.

with recursive tmp as (     
  select * from a where id=30   -- Enter my USER ID     
  union all     
  select a.* from a join tmp on (a.pid=tmp.id)     
)    
select tmp.id, count(*) cnt, sum(state) succ_cnt from tmp     
  join b on (tmp.id=b.aid)    
  join d on (b.id=d.bid)    
  join log on (d.id=log.did)    
  where crt_time between cdate() and cts()    
  group by tmp.id     
  order by cnt desc limit 10    
  ;    
  id  | cnt | succ_cnt     
------+-----+----------    
 2996 | 385 |      353    
 2969 | 339 |      301    
 2935 | 335 |      312    
 2936 | 332 |      304    
 2988 | 326 |      290    
 2986 | 321 |      295    
 2960 | 319 |      293    
 2964 | 313 |      276    
 2994 | 309 |      268    
 2975 | 308 |      276    
(10 rows)    
    
Time: 641.719 ms    

5. PostgreSQL 10 Scheme Design 2 - Limit Optimization

Optimal Point Analysis of Scheme 1

As you can see, although parallel is used, the efficiency of some perspective queries is not as high as 100 milliseconds.

The main consumption is at the JOIN level. Although hashing JOIN has been done in parallel, the next optimization method is wonderful. It can automatically fill up the real upstream information (ltree, categories, stores, etc.) when the order is written.

After completing the information, the perspective without JOIN can be realized.

How to make it up?

When completing, the sales staff must include all hierarchical relationships, so we chose the PostgreSQL ltree tree type to store this relationship.

When writing an order, it can be automatically filled according to the device number (user ID(ltree), category and store) by trigger.

1. Creating Tree Types

create extension ltree;     

2. Create composite types, including tree, category, store information.

create type ntyp as (lt ltree, cid int, bid int);    

Adding new fields to the order form

alter table log add column addinfo ntyp;  

3. Create materialized view 1 to store real-time employee structure. After materialization, there is no need to query recursively.

CREATE MATERIALIZED VIEW mv1 as     
select id, (    
  with recursive tmp as (    
  select id::text as path from a where id=t.id    
  union all     
  select a.pid||'.'||tmp.path as path from a join tmp on (a.id=substring(tmp.path, '^([\d]+)')::int)     
  )    
  select * from tmp order by length(path) desc nulls last limit 1    
) from a as t;    

3.1. Create UK

create unique index mv1_uk1 on mv1 (id);    

3.2. Refresh the new method, when the employee structure changes, just brush it. The refresh speed is very fast.

refresh materialized view CONCURRENTLY mv1;    

4. Create materialized view 2, real-time device complement value (category and store ID). After physicalization, categories and stores can be taken out directly by means of equipment number.

CREATE MATERIALIZED VIEW mv2 as     
select a.id as aid, c.id as cid, b.id as bid, d.id as did from     
  a join b on (a.id=b.aid)     
    join c on (c.id=b.cid)     
    join d on (d.bid=b.id)    
;    

4.1. Create UK

create unique index mv2_uk1 on mv2(did);    

4.2. Incrementally refresh the materialized view. When the relationship between equipment and stores or categories changes, refresh it. The refresh speed is very fast.

refresh materialized view CONCURRENTLY mv2;    

5. Create a function to get the information of the device number by the device number: (user ID(ltree), category, store)

create or replace function gen_res (vdid int) returns ntyp as $$    
  select (mv1.path, mv2.cid, mv2.bid)::ntyp from     
  mv1 join mv2 on (mv1.id=mv2.aid) where mv2.did=vdid;    
$$ language sql strict;    

7. Create triggers for order forms and automatically complete relationships (equipment - > Stores - > categories and sales - > hierarchical relationships)

create or replace function tg() returns trigger as $$    
declare    
begin    
  NEW.addinfo := gen_res(NEW.did);    
  return NEW;    
end;    
$$ language plpgsql strict;    
    
create trigger tg before insert on log_201701 for each row execute procedure tg();    
create trigger tg before insert on log_201702 for each row execute procedure tg();    
create trigger tg before insert on log_201703 for each row execute procedure tg();    
create trigger tg before insert on log_201704 for each row execute procedure tg();    
create trigger tg before insert on log_201705 for each row execute procedure tg();    
create trigger tg before insert on log_201706 for each row execute procedure tg();    
create trigger tg before insert on log_201707 for each row execute procedure tg();    
create trigger tg before insert on log_201708 for each row execute procedure tg();    
create trigger tg before insert on log_201709 for each row execute procedure tg();    
create trigger tg before insert on log_201710 for each row execute procedure tg();    
create trigger tg before insert on log_201711 for each row execute procedure tg();    
create trigger tg before insert on log_201712 for each row execute procedure tg();    
create trigger tg before insert on log_201801 for each row execute procedure tg();    

8, effect

postgres=# insert into log values (1,1,now());    
INSERT 0 1    
    
    
postgres=# select * from log_201709 where did=1;    
 did | state |          crt_time          | mod_time |        addinfo            
-----+-------+----------------------------+----------+-----------------------    
   1 |     1 | 2017-09-23 16:58:47.736402 |          | (0.17.1702,60,417943)    

9. Revise the old data, fill in the empty record of the device number (user ID(ltree), category, store) (for example, at some time, the device number is new, has not been refreshed to MV1, MV2).

update log set addinfo=gen_res(did) where addinfo is null;    

Completed data perspective (completely avoiding JOIN), open parallel, fast thieves

1. Full amount (unchanged, performance bar), 74 milliseconds.

select t1.cnt, t1.succ_cnt, t2.cnt, t2.succ_cnt from    
(    
  select count(*) cnt, sum(state) succ_cnt from log where crt_time between cdate() and cts()    
) t1,    
(    
  select count(*) cnt, sum(state) succ_cnt from log where crt_time between cdate()-1 and cts(interval '1 day')    
) t2;    
    
  cnt   | succ_cnt |  cnt   | succ_cnt     
--------+----------+--------+----------    
 836965 |   753286 | 836964 |   753178    
(1 row)    
    
Time: 74.205 ms    

2. Category TOP, 41 milliseconds.

postgres=# select (log.addinfo).cid, count(*) cnt, sum(state) succ_cnt from log    
  where crt_time between cdate() and cts()     
  group by (log.addinfo).cid    
  order by cnt desc limit 10;     
    
 cid | cnt  | succ_cnt     
-----+------+----------    
  70 | 8796 |     7919    
  39 | 8793 |     7930    
  64 | 8700 |     7863    
  13 | 8659 |     7777    
  29 | 8621 |     7787    
  71 | 8613 |     7739    
  79 | 8613 |     7719    
   3 | 8597 |     7714    
  75 | 8590 |     7747    
  90 | 8579 |     7725    
(10 rows)    
    
Time: 41.221 ms    

3. My total sales (including all subordinates), 41 milliseconds

select count(*) cnt, sum(state) succ_cnt from log     
  where crt_time between cdate() and cts()    
  and (log.addinfo).lt ~ '*.1.*'    -- seek USER ID = 1 Total sales (including all subordinates)    
  ;    
    
  cnt  | succ_cnt     
-------+----------    
 28502 |    25627    
(1 row)    
    
Time: 41.065 ms    

4. My direct subordinate, TOP

BOSS View, 111 milliseconds.

select substring(((log.addinfo).lt)::text, '\.?(0\.?[\d]*)'),   -- USER ID = 0 Direct subordinates, please use the input user ID replace    
  count(*) cnt, sum(state) succ_cnt from log     
  where crt_time between cdate() and cts()    
  and (log.addinfo).lt ~ '*.0.*'                                -- USER ID = 0,Please use the input user ID Replace.    
  group by 1                                                    -- The first field is grouping    
  order by cnt desc limit 10    
;    
    
 substring |  cnt  | succ_cnt     
-----------+-------+----------    
 0.19      | 28656 |    25756    
 0.15      | 28655 |    25792    
 0.26      | 28560 |    25721    
 0.1       | 28548 |    25668    
 0.9       | 28545 |    25701    
 0.6       | 28506 |    25706    
 0.12      | 28488 |    25646    
 0.17      | 28485 |    25652    
 0.21      | 28469 |    25665    
 0.3       | 28459 |    25486    
(10 rows)    
    
Time: 111.221 ms    

Level 1 Sales Manager Perspective, 41 milliseconds

select substring(((log.addinfo).lt)::text, '\.?(1\.?[\d]*)'),   -- USER ID = 1 Direct subordinates, please use the input user ID replace    
  count(*) cnt, sum(state) succ_cnt from log     
  where crt_time between cdate() and cts()    
  and (log.addinfo).lt ~ '*.1.*'                                -- USER ID = 1,Please use the input user ID Replace.    
  group by 1                                                    -- The first field is grouping    
  order by cnt desc limit 10    
;    
    
 substring | cnt | succ_cnt     
-----------+-----+----------    
 1.120     | 368 |      320    
 1.59      | 367 |      331    
 1.54      | 357 |      316    
 1.93      | 344 |      313    
 1.80      | 342 |      306    
 1.37      | 338 |      305    
 1.64      | 334 |      298    
 1.90      | 329 |      299    
 1.66      | 327 |      296    
 1.109     | 326 |      293    
(10 rows)    
    
Time: 41.276 ms    

5. All my subordinates (recursion), TOP

BOSS perspective (TOP for all end sales), 231 milliseconds.

select (log.addinfo).lt,                                        -- All subordinates(recursion)    
  count(*) cnt, sum(state) succ_cnt from log     
  where crt_time between cdate() and cts()    
  and (log.addinfo).lt ~ '*.0.*'                                -- USER ID = 0,Please use the input user ID Replace.    
  group by 1                                                    -- The first field is grouping    
  order by cnt desc limit 10    
;    
    
    lt     | cnt | succ_cnt     
-----------+-----+----------    
 0.30.2996 | 405 |      371    
 0.28.2796 | 402 |      350    
 0.21.2093 | 393 |      347    
 0.3.234   | 391 |      356    
 0.14.1332 | 381 |      347    
 0.13.1283 | 381 |      344    
 0.19.1860 | 380 |      347    
 0.16.1553 | 380 |      341    
 0.28.2784 | 377 |      346    
 0.7.672   | 377 |      347    
(10 rows)    
    
Time: 230.630 ms    

Level 1 Sales Manager Perspective, 41 milliseconds

select (log.addinfo).lt,                                        -- All subordinates(recursion)    
  count(*) cnt, sum(state) succ_cnt from log     
  where crt_time between cdate() and cts()    
  and (log.addinfo).lt ~ '*.1.*'                                -- USER ID = 1,Please use the input user ID Replace.    
  group by 1                                                    -- The first field is grouping    
  order by cnt desc limit 10    
;    
    
   lt    | cnt | succ_cnt     
---------+-----+----------    
 0.1.59  | 367 |      331    
 0.1.120 | 367 |      320    
 0.1.54  | 355 |      315    
 0.1.93  | 344 |      313    
 0.1.80  | 341 |      305    
 0.1.37  | 338 |      305    
 0.1.64  | 334 |      298    
 0.1.90  | 328 |      298    
 0.1.66  | 327 |      296    
 0.1.109 | 325 |      293    
(10 rows)    
    
Time: 41.558 ms    

Benefits of replenishing order addinfo information

1. When the personnel structure, categories and stores change, the business side should decide whether it is necessary to revise the data in the order (user ID(ltree), categories and stores).

2. Actually, the original method is problematic, such as the equipment laid by manager A. After a month, the person in charge changes. When statistics are made, if the real-time JOIN is used, the order involving last month will be put on the head of the new person in charge, but there are obvious errors.

3. Feeling or filling up the method is more precise, who is who is who, will not make a mistake (mistake sales can be serious, affecting people's performance.)

Summary of PostgreSQL 10

What PostgreSQL database features are used?

1. Recursive Query

2. Parallel Query

3. JOIN Method

4. Inheritance (partition table)

5. Triggers

6. Compound Types

7. ltree tree type

https://www.postgresql.org/docs/9.6/static/ltree.html

7. Greenplum

Greenplum scheme 1

Note that Greenplum's TP capability is weak as mentioned earlier. If the device heartbeat is updated in real time, orders are written in real time, and updates in real time, it may not be able to withstand the pressure. (At present, green plum update, delete are all full-table locks, very large locks.)

Therefore, it is important to note in the design that the heartbeat of the device is updated in batches (e.g. from the TP database, exporting the full amount to Greenplum every few minutes). Insert the order updates (through RULE).

Table Structure Design

create table a (          -- Employee Hierarchy Information    
  id int primary key,     -- number ID    
  nick name,              -- Name    
  pid int                 -- Superior ID    
) DISTRIBUTED BY(id);    
    
create table c (          -- Category    
  id int primary key,     -- Category ID    
  comment text            -- Category name    
) DISTRIBUTED BY(id);    
    
create table b (          -- Terminal store    
  id int primary key,     -- number    
  nick text,              -- Name    
  cid int,                -- Category    
  aid int                 -- store manager ID    
) DISTRIBUTED BY(id);    
    
create table d (          -- equipment    
  id int primary key,     -- Equipment number    
  bid int,                -- Shop number    
  alive_ts timestamp      -- Heart beat time of equipment    
) DISTRIBUTED BY(id);    
    
create table log1 (        -- Order log, create order    
  did int,                -- equipment ID    
  state int2,             -- Order Final Status    
  crt_time timestamp,     -- Order creation time    
  mod_time timestamp      -- Order revision time    
) DISTRIBUTED BY(did)     
PARTITION BY range (crt_time)    
(start (date '2017-01-01') inclusive end (date '2018-01-01') exclusive every (interval '1 month'));     
    
create table log2 (        -- Order log, final status    
  did int,                -- equipment ID    
  state int2,             -- Order Final Status    
  crt_time timestamp,     -- Order creation time    
  mod_time timestamp      -- Order revision time    
) DISTRIBUTED BY(did)     
PARTITION BY range (crt_time)    
(start (date '2017-01-01') inclusive end (date '2018-01-01') exclusive every (interval '1 month'));     
    
-- Create rules, update to insert    
create rule r1 as on update to log1 do instead insert into log2 values (NEW.*);    

Test heartbeat meter import speed

Importing 1 million device data takes about 1 second.

date +%F%T;psql -c "copy d to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy d from stdin"; date +%F%T;    
    
2017-09-2319:42:22    
COPY 1000000    
2017-09-2319:42:23    

Test order writing speed

Note that all write operations are recommended for batch operation.

Batch writing is about 870,000 lines/s.

date +%F%T; psql -c "copy (select did,state,crt_time,mod_time from log) to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy log1 from stdin"; date +%F%T;    
    
2017-09-2320:04:44    
COPY 378432001    
2017-09-2320:12:03    

Data import

psql -c "copy a to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy a from stdin"    
psql -c "copy b to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy b from stdin"    
psql -c "copy c to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy c from stdin"    
# psql -c "copy d to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy d from stdin"    
# psql -c "copy (select * from log) to stdout"|psql -h 127.0.0.1 -p 15432 -U digoal postgres -c "copy log1 from stdin"    

Perspective of SQL Testing

1. Full perspective, 610 milliseconds.

select t1.cnt, t1.succ_cnt, t2.cnt, t2.succ_cnt from    
(    
  select count(*) cnt, sum(state) succ_cnt from log1 where crt_time between cdate() and cts(interval '0')    
) t1,    
(    
  select count(*) cnt, sum(state) succ_cnt from log1 where crt_time between cdate()-1 and cts(interval '1 day')    
) t2;    
    
  cnt   | succ_cnt |  cnt   | succ_cnt     
--------+----------+--------+----------    
 876301 |   788787 | 876300 |   788564    
(1 row)    
    
Time: 609.801 ms    

2. Category TOP, 219 milliseconds.

select c.id, count(*) cnt, sum(state) succ_cnt from c     
    join b on (c.id=b.cid)     
    join d on (b.id=d.bid)     
    join log1 on (d.id=log1.did)     
  where crt_time between cdate() and cts(interval '0')    
  group by c.id    
  order by cnt desc limit 10;    
    
 id | cnt  | succ_cnt     
----+------+----------    
 70 | 9220 |     8311    
 39 | 9197 |     8303    
 64 | 9096 |     8220    
 79 | 9034 |     8095    
 13 | 9033 |     8114    
 29 | 9033 |     8151    
 75 | 9033 |     8148    
  3 | 9005 |     8084    
 71 | 9002 |     8098    
 90 | 8974 |     8079    
(10 rows)    
    
Time: 218.695 ms    

3. My total sales (including all subordinates), 208 milliseconds.

Returns all subordinates and the current user ID.

create or replace function find_low(int) returns int[] as $$    
declare    
  res int[] := array[$1];    
  tmp int[] := res;    
begin    
  loop    
    select array_agg(id) into tmp from a where pid = any (tmp);    
    res := array_cat(res,tmp);    
    if tmp is null then    
      exit;    
    end if;    
  end loop;    
  return res;    
end;    
$$ language plpgsql strict;    
select count(*) cnt, sum(state) succ_cnt from     
(select unnest(find_low(31)) as id) as tmp     
  join b on (tmp.id=b.aid)    
  join d on (b.id=d.bid)    
  join log1 on (d.id=log1.did)    
  where crt_time between cdate() and cts(interval '0')    
  ;    
 cnt | succ_cnt     
-----+----------    
 342 |      312    
(1 row)    
    
Time: 208.585 ms    

4. My direct subordinate, TOP.

Greenplum does not support recursive syntax for the time being and needs to customize UDF implementations.

5. All my subordinates (recursive), TOP.

Greenplum does not support recursive syntax for the time being and needs to customize UDF implementations.

Greenplus Scheme 2

Like PostgreSQL Solution 2, it materializes the data of "devices corresponding to stores, categories, sales, sales and all his superiors".

Dead work:

1. New fields

alter table log1 add column aid int;  
alter table log1 add column path text;  
alter table log1 add column cid int;  
alter table log1 add column bid int;  
  
alter table log2 add column aid int;  
alter table log2 add column path text;  
alter table log2 add column cid int;  
alter table log2 add column bid int;  

2. Modify the rule defined before, the update of business is converted to INSERT, and the update operation of batch order completion is not converted.

drop rule r1 on log1;  
  
create rule r1 as on update to log1 where (NEW.aid is null) do instead insert into log2 values (NEW.*);    

Materialization

1. Physical View 1: Equipment - > Stores - > Categories - > Sales

Create materialized view mv1:

create table mv1 (did int, bid int, cid int, aid int) distributed by (did);  
create index idx_mv1_did on mv1(did);  

Initialize materialized view mv1:

insert into mv1   
  select d.id as did, b.id as bid, c.id as cid, a.id as aid from d join b on (d.bid=b.id) join c on (b.cid=c.id) join a on (a.id=b.aid);  

Refresh materialized view mv1:

begin;  
update mv1 set bid=t1.bid , cid=t1.cid , aid=t1.aid  
  from   
  (  
    select d.id as did, b.id as bid, c.id as cid, a.id as aid from d join b on (d.bid=b.id) join c on (b.cid=c.id) join a on (a.id=b.aid)  
  ) t1  
where mv1.did=t1.did and (t1.bid<>mv1.bid or t1.cid<>mv1.cid or t1.aid<>mv1.aid);  
  
insert into mv1   
  select t1.* from  
  (  
    select d.id as did, b.id as bid, c.id as cid, a.id as aid from d join b on (d.bid=b.id) join c on (b.cid=c.id) join a on (a.id=b.aid)  
  ) t1  
  left join mv1 on (t1.did=mv1.did) where mv1.* is null;  
end;  
vacuum mv1;  

2. Physical View 2: Sales - > Sales and all his superiors

Create functions that return sales and all of his superiors

create or replace function find_high(int) returns text as $$    
declare    
  res text := $1;    
  tmp text := res;    
begin    
  loop    
    select pid into tmp from a where id = tmp::int;    
    if tmp is null then    
      exit;    
    end if;    
    res := tmp||'.'||res;   
  end loop;    
  return res;    
end;    
$$ language plpgsql strict;    

Without recursive grammar, Greenplum's function call efficiency is not high:

postgres=# select find_high(id) from generate_series(100,110) t(id);  
 find_high   
-----------  
 0.1.100  
 0.1.101  
 0.1.102  
 0.1.103  
 0.1.104  
 0.1.105  
 0.1.106  
 0.1.107  
 0.1.108  
 0.1.109  
 0.1.110  
(11 rows)  
Time: 1472.435 ms  
  
//The same operation in PostgreSQL takes only 0.5 milliseconds:  
  
postgres=# select find_high(id) from generate_series(100,110) t(id);  
 find_high   
-----------  
 0.1.100  
 0.1.101  
 0.1.102  
 0.1.103  
 0.1.104  
 0.1.105  
 0.1.106  
 0.1.107  
 0.1.108  
 0.1.109  
 0.1.110  
(11 rows)  
Time: 0.524 ms  

Verification

postgres=# select find_high(1);  
 find_high   
-----------  
 0.1  
(1 row)  
  
postgres=# select find_high(0);  
 find_high   
-----------  
 0  
(1 row)  
  
postgres=# select find_high(100);  
 find_high   
-----------  
 0.1.100  
(1 row)  

Create materialized view mv2

create table mv2 (aid int, path text) distributed by (aid);  
create index idx_mv2_did on mv2(aid);  

Initialize and refresh materialized view mv2

-- GP It is simple to not support such an operation: insert into mv2 select id, find_high(id) from a;  
  
postgres=# select id, find_high(id) from a;  
ERROR:  function cannot execute on segment because it accesses relation "postgres.a" (functions.c:155)  (seg1 slice1 tb2a07543.sqa.tbc:25433 pid=106586) (cdbdisp.c:1328)  
DETAIL:    
SQL statement "select pid from a where id =  $1 "  
PL/pgSQL function "find_high" line 7 at SQL statement  

Create function

create or replace function refresh_mv2() returns void as $$  
declare  
  aid int[];  
begin  
  select array_agg(id) into aid from a;  
  delete from mv2;  
  insert into mv2 select id, find_high(id) from unnest(aid) t(id);  
end;  
$$ language plpgsql strict;  

Call function refresh mv2, time is basically unacceptable.

select refresh_mv2();  

PS: The proposed program generates this part of the staff tree structure data. Then insert it into GPDB. Because there are only 3001. Or you can generate it in PostgreSQL. PG is so convenient.

Revised order

Scheduling tasks, batch updates:

update log1 set aid=t1.aid, path=t1.path, cid=t1.cid, bid=t1.bid  
from  
(  
select did, bid, cid, mv1.aid, mv2.path from mv1 join mv2 on (mv1.aid=mv2.aid)  
) t1  
where log1.did=t1.did and log1.aid is null;  
  
UPDATE 378432001  
  
  
update log2 set aid=t1.aid, path=t1.path, cid=t1.cid, bid=t1.bid  
from  
(  
select did, bid, cid, mv1.aid, mv2.path from mv1 join mv2 on (mv1.aid=mv2.aid)  
) t1  
where log2.did=t1.did and log2.aid is null;  
  
UPDATE 378432001  

Perspective query

1. Full fluoroscopy, 205 milliseconds.

select t1.cnt, t1.succ_cnt, t2.cnt, t2.succ_cnt from    
(    
  select count(*) cnt, sum(state) succ_cnt from log1 where crt_time between cdate() and cts(interval '0')    
) t1,    
(    
  select count(*) cnt, sum(state) succ_cnt from log1 where crt_time between cdate()-1 and cts(interval '1 day')    
) t2;   
  
  cnt   | succ_cnt |  cnt   | succ_cnt   
--------+----------+--------+----------  
 480228 |   432151 | 480228 |   432205  
(1 row)  
  
Time: 205.436 ms  

2. Category TOP, 254 milliseconds.

select c.id, count(*) cnt, sum(state) succ_cnt from c     
    join b on (c.id=b.cid)     
    join d on (b.id=d.bid)     
    join log1 on (d.id=log1.did)     
  where crt_time between cdate() and cts(interval '0')    
  group by c.id    
  order by cnt desc limit 10;    
 id | cnt  | succ_cnt   
----+------+----------  
 64 | 5052 |     4555  
 29 | 4986 |     4483  
 34 | 4982 |     4509  
 70 | 4968 |     4466  
 71 | 4964 |     4491  
  5 | 4953 |     4474  
 79 | 4937 |     4454  
 63 | 4936 |     4420  
 66 | 4934 |     4436  
 18 | 4922 |     4417  
(10 rows)  
  
Time: 254.007 ms  

3. My total sales (including all subordinates), 110 milliseconds.

select count(*) cnt, sum(state) succ_cnt from log1     
  where crt_time between cdate() and cts(interval '0')    
  and (path like '1.%' or path like '%.1' or path like '%.1.%')    -- seek USER ID = 1 Total sales (including all subordinates)        
  ;   
  cnt  | succ_cnt   
-------+----------  
 16605 |    14964  
(1 row)  
  
Time: 110.396 ms  

4. My direct subordinate, TOP.

BOSS View, 180 milliseconds.

set escape_string_warning TO off;  
  
select substring(path, '\.?(0\.?[0-9]*)'),                       -- USER ID = 0 Direct subordinates, please use the input user ID replace    
  count(*) cnt, sum(state) succ_cnt from log1     
  where crt_time between cdate() and cts(interval '0')    
  and (path like '0.%' or path like '%.0' or path like '%.0.%')  -- USER ID = 0,Please use the input user ID Replace.    
  group by 1                                                     -- The first field is grouping    
  order by cnt desc limit 10    
;    
  
 substring |  cnt  | succ_cnt   
-----------+-------+----------  
 0.3       | 17014 |    15214  
 0.15      | 17006 |    15285  
 0.11      | 16958 |    15285  
 0.22      | 16901 |    15231  
 0.19      | 16887 |    15217  
 0.21      | 16861 |    15160  
 0.6       | 16841 |    15075  
 0.9       | 16831 |    15123  
 0.26      | 16787 |    15060  
 0.14      | 16777 |    15048  
(10 rows)  
  
Time: 179.950 ms  

Level 1 Sales Manager Perspective, 176 milliseconds

select substring(path, '\.?(1\.?[0-9]*)'),                       -- USER ID = 1 Direct subordinates, please use the input user ID replace    
  count(*) cnt, sum(state) succ_cnt from log1     
  where crt_time between cdate() and cts(interval '0')    
  and (path like '1.%' or path like '%.1' or path like '%.1.%')  -- USER ID = 1,Please use the input user ID Replace.    
  group by 1                                                     -- The first field is grouping    
  order by cnt desc limit 10    
;    
  
 substring | cnt | succ_cnt   
-----------+-----+----------  
 1.120     | 222 |      202  
 1.54      | 218 |      193  
 1.92      | 217 |      192  
 1.51      | 209 |      187  
 1.93      | 206 |      181  
 1.53      | 203 |      182  
 1.59      | 203 |      187  
 1.37      | 202 |      188  
 1.82      | 197 |      177  
 1.66      | 196 |      180  
(10 rows)  
  
Time: 176.298 ms  

5. All my subordinates (recursive), TOP.

BOSS perspective (TOP for all end sales), 155 milliseconds.

select path,                                                      -- All subordinates(recursion)    
  count(*) cnt, sum(state) succ_cnt from log1     
  where crt_time between cdate() and cts(interval '0')    
  and (path like '0.%' or path like '%.0' or path like '%.0.%')   -- USER ID = 0,Please use the input user ID Replace.    
  group by 1                                                      -- The first field is grouping    
  order by cnt desc limit 10    
;    
    
   path    | cnt | succ_cnt   
-----------+-----+----------  
 0.5.482   | 261 |      229  
 0.28.2796 | 248 |      229  
 0.24.2348 | 242 |      225  
 0.13.1318 | 240 |      213  
 0.21.2093 | 237 |      211  
 0.26.2557 | 235 |      210  
 0.4.346   | 233 |      205  
 0.30.2935 | 231 |      214  
 0.14.1332 | 229 |      205  
 0.26.2620 | 229 |      204  
(10 rows)  
  
Time: 155.268 ms    

First-level Sales Manager Perspective, 151 milliseconds

select path,                                                      -- All subordinates(recursion)    
  count(*) cnt, sum(state) succ_cnt from log1     
  where crt_time between cdate() and cts(interval '0')    
  and (path like '1.%' or path like '%.1' or path like '%.1.%')   -- USER ID = 1,Please use the input user ID Replace.    
  group by 1                                                      -- The first field is grouping    
  order by cnt desc limit 10    
;    
    
  path   | cnt | succ_cnt   
---------+-----+----------  
 0.1.120 | 222 |      202  
 0.1.92  | 218 |      193  
 0.1.54  | 218 |      193  
 0.1.51  | 209 |      187  
 0.1.93  | 207 |      182  
 0.1.59  | 204 |      187  
 0.1.53  | 203 |      182  
 0.1.37  | 202 |      188  
 0.1.82  | 198 |      178  
 0.1.66  | 196 |      180  
(10 rows)  
  
Time: 150.883 ms  

Summary of Greenplum

1. The use of Greenplum requires attention to data skew, so please refer to the selection of distribution keys:

Reasons and Solutions for Data Tilt in Distributed DB(Greenplum) - Aliyun HybridDB for PostgreSQL Best Practices

2. Greenplum does not support recursive grammar for the time being, so UDF is needed to implement operations like finding all subordinates or completing all subordinates.

3. Scheme 2 of Greenplum. The key point is materialized view and filling (in fact, it doesn't matter if it is not filled in the order), as long as a table (equipment number - > store - > category and employee hierarchy) is generated, it will be much more convenient to query.

4. The delete and update operations of Greenplum lock the whole table and block the insert, delete and update operations of other tables. No blocking queries. Special attention is needed.

5. Batch updating is adopted for order replenishment.

Nine, summary

For this example, it is recommended to use PostgreSQL 10 (especially when the future volume is going to be 100 TB, migrating to PolarDB for PostgreSQL will be particularly convenient and fully compatible.) In terms of performance, both TP and AP meet the requirements. Functional aspects also fully meet the needs, and there are many features that can be used to enhance the user experience:

If you want to use the Greenplum(HybridDB for PostgreSQL) solution, it is recommended that you still use a design method similar to PostgreSQL 10 solution 2 (order completion is implemented using rules, or batch update).

1. Recursive query, which is used to retrieve tree-structured data, such as employee hierarchy, schema search, etc.

2. Parallel query can effectively utilize the ability of multiple CPU s, which is similar to the enlargement in the game, and accelerate the query.

3. JOIN methods, including hash, merge, nestloop and other JOIN methods, can handle any complex JOIN.

4. Inheritance (partition table), order partition by time.

5. Flip-flop, which is used to complete orders automatically.

6. Complete the information of "Equipment - > Store - > Category and Employee Hierarchy".

7. ltree tree type, store the completed employee-supervisor relationship.

https://www.postgresql.org/docs/9.6/static/ltree.html

8. Physical view, which is used to complement the employee rank. One-click refresh does not require business to handle complex personnel change logic. It also facilitates the realization of perspective analysis statements.

9. Regular expressions are used in the regular matching of ltree, such as aggregation by direct subordinate group, aggregation by current login user group, etc.

10. and many features not used in this scheme (such as SQL stream computing, oss_ext object storage external tables, etc.).

Next, Aliyun will launch PolarDB for PostgreSQL, 100TB level, shared storage, one-write multi-read architecture. AWS Aurora and Oracle RAC.

11. The real-time perspective QUERY performance comparison of three schemes (the same hardware resources, 32C):

programme use case response time
PostgreSQL 10 Scheme 1 Full perspective 77 milliseconds
PostgreSQL 10 Scheme 1 Category TOP 446 milliseconds
PostgreSQL 10 Scheme 1 My total sales (including all subordinates) 464 milliseconds
PostgreSQL 10 Scheme 1 My immediate subordinate, TOP 2.6 seconds
PostgreSQL 10 Scheme 1 All my subordinates (recursive), TOP 642 milliseconds
- - -
PostgreSQL 10 Scheme 2 Full perspective 74 milliseconds
PostgreSQL 10 Scheme 2 Category TOP 41 milliseconds
PostgreSQL 10 Scheme 2 My total sales (including all subordinates) 41 milliseconds
PostgreSQL 10 Scheme 2 My immediate subordinate, TOP 41 milliseconds
PostgreSQL 10 Scheme 2 All my subordinates (recursive), TOP 41 milliseconds
- - -
Greenplum scheme 1 Full perspective 610 milliseconds
Greenplum scheme 1 Category TOP 219 milliseconds
Greenplum scheme 1 My total sales (including all subordinates) 208 milliseconds
Greenplum scheme 1 My immediate subordinate, TOP No recursive, untested support
Greenplum scheme 1 All my subordinates (recursive), TOP No recursive, untested support
- - -
Greenplus Scheme 2 Full perspective 205 milliseconds
Greenplus Scheme 2 Category TOP 254 milliseconds
Greenplus Scheme 2 My total sales (including all subordinates) 110 milliseconds
Greenplus Scheme 2 My immediate subordinate, TOP 176 milliseconds
Greenplus Scheme 2 All my subordinates (recursive), TOP 151 milliseconds

12. The differences between Greenplum and PostgreSQL products and how to select them can be referred to:

"Space | Time | Object Circle People + Perspective - and PostgreSQL 10 and Greenplum's Contrast and Selection"

Chapter: The features and selection guidance of Greenplum and PostgreSQL products.

Posted by lbraine on Thu, 20 Dec 2018 16:42:05 -0800