PostgreSQL Scenario and Performance Testing for HTAP Database 18 - (OLAP) User Portrait Circle Scenario - Array Contains Query and Aggregation

Keywords: github PostgreSQL Database SQL

Label

PostgreSQL, HTAP, OLTP, OLAP, Scenario and Performance Testing

background

PostgreSQL is a database with a long history, which can be traced back to 1973. It was first awarded by the 2014 Computer Turing Prize winner and the originator of relational database. Michael_Stonebraker PostgreSQL has similar functions, performance, architecture and stability as Oracle.

The PostgreSQL community has many contributors, from all walks of life around the world. Over the years, PostgreSQL releases a large version every year, known for its enduring vitality and stability.

In October 2017, PostgreSQL released version 10, which carries many amazing features. The goal is to meet the requirements of mixed HTAP scenarios of OLAP and OLTP:

PostgreSQL 10 features, the most popular HTAP database for developers

1. Multi-core Parallel Enhancement

2. fdw aggregation push-down

3. Logical Subscription

4, zoning

5. Multi-copies at the financial level

6. json, jsonb full-text retrieval

7. There are also some features of plug-in form, such as vector computing, JIT, SQL graph computing, SQL stream computing, distributed parallel computing, time series processing, gene sequencing, chemical analysis, image analysis, etc.

PostgreSQL can be seen in various application scenarios:

PostgreSQL has developed very rapidly in recent years. From the database scoring trend of dbranking, a well-known database evaluation website, we can see the upward trend of PostgreSQL:

From the annual community meetings held in PostgreSQL China, we can see the same trend. More and more companies are participating, more and more companies are sharing, and more and more subjects are shared. They span traditional enterprises, Internet, medical, financial, state-owned enterprises, logistics, e-commerce, social networking, vehicle networking, sharing XX, cloud, game, public transport, aviation, railway, military, training. Training, consulting services, etc.

The next series of articles will introduce various application scenarios and corresponding performance indicators of PostgreSQL.

Environmental Science

Reference to Environmental Deployment Method:

PostgreSQL 10 + PostGIS + Sharing (pg_pathman) + MySQL (fdw external table) on ECS Deployment Guide (for new users)

Aliyun ECS: 56 core, 224G, 1.5TB*2 SSD cloud disk.

Operating System: CentOS 7.4 x64

Database version: PostgreSQL 10

PS: The CPU and IO performance of ECS is discounted by physical opportunity, and can be estimated by one-fold performance decline. Running physical mainframe can be estimated by multiplying the performance tested here by 2.

Scene - user portrait circle scene - array contains query and aggregation (OLAP)

1, background

Array is a multi-valued type of PostgreSQL, which can store many similar elements. When designing a business system, you can use arrays to store tags, aggregate attributes, and so on.

For example, in a user portrait system, an array is used to store the user's label. When a group of users need to be selected according to the label combination circle, the selected records are filtered by means of inclusion and intersection of arrays.

Contains all tags that represent the inclusion of the target condition.

Intersection represents any tag that contains the target condition.

2, design

100 million records, each record contains 16 labels, the value range of labels is 10,000. It also contains three attribute fields for perspective.

3. Preparing test sheets

create table t_arr_label(
  id int,
  c1 int,
  c2 int,
  c3 int,
  label int[]
);

4. Preparing test functions (optional)

In some ranges, an array containing several random values is generated

create or replace function gen_rand_arr(int,int) returns int[] as $$    
  select array_agg((random()*$1)::int) from generate_series(1,$2);    
$$ language sql strict;

The test search contains records of several elements, carries out perspective, and outputs perspective results.

create or replace function f_test () returns setof record as $$  
declare  
  varr int[];  
begin  
  -- Generate a random array  (Contains any three Tags)   
  select gen_rand_arr(10000, 3) into varr;  
  -- According to the label, the data is filtered and the perspective output is made.
  return query select c1,c2,c3,count(*) from t_arr_label where label @> varr group by grouping sets ((c1),(c2),(c3)); 
end;  
$$ language plpgsql strict;

5. Preparing test data

insert into t_arr_label select id, random()*100, random()*10, random()*2, gen_rand_arr(10000, 16) from generate_series(1,100000000) t(id);

create index idx_t_arr_label on t_arr_label using gin (label);

6. Preparing test scripts

vi test.sql  
  
select * from f_test() as t(c1 int, c2 int, c3 int ,cnt int8);

7, test

Single similar query efficiency, response time is less than 20 milliseconds. (With bound variables and CACHE hits, the response time is lower. )

postgres=# select c1,c2,c3,count(*) from t_arr_label where label @> '{1,2}' group by grouping sets ((c1),(c2),(c3));
 c1  | c2 | c3 | count 
-----+----+----+-------
   1 |    |    |     6
   4 |    |    |     1
   6 |    |    |     2
   8 |    |    |     1
.............
  98 |    |    |     3
  99 |    |    |     1
 100 |    |    |     2
     |    |  0 |    62
     |    |  1 |   149
     |    |  2 |    53
     |  0 |    |     9
     |  1 |    |    22
     |  2 |    |    30
     |  3 |    |    26
     |  4 |    |    22
     |  5 |    |    23
     |  6 |    |    34
     |  7 |    |    22
     |  8 |    |    31
     |  9 |    |    33
     | 10 |    |    12
(102 rows)

Time: 16.050 ms



postgres=# explain (analyze,verbose,timing,costs,buffers) select c1,c2,c3,count(*) from t_arr_label where label @> '{1,2}' group by grouping sets ((c1),(c2),(c3));
                                                                QUERY PLAN                                                                
------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=111.75..121.83 rows=222 width=20) (actual time=15.402..15.590 rows=102 loops=1)
   Output: c1, c2, c3, count(*)
   Group Key: t_arr_label.c1
   Sort Key: t_arr_label.c3
     Group Key: t_arr_label.c3
   Sort Key: t_arr_label.c2
     Group Key: t_arr_label.c2
   Buffers: shared hit=419
   ->  Sort  (cost=111.75..111.97 rows=90 width=12) (actual time=15.394..15.422 rows=264 loops=1)
         Output: c1, c2, c3
         Sort Key: t_arr_label.c1
         Sort Method: quicksort  Memory: 37kB
         Buffers: shared hit=419
         ->  Bitmap Heap Scan on public.t_arr_label  (cost=17.70..108.82 rows=90 width=12) (actual time=14.711..15.327 rows=264 loops=1)
               Output: c1, c2, c3
               Recheck Cond: (t_arr_label.label @> '{1,2}'::integer[])
               Heap Blocks: exact=264
               Buffers: shared hit=419
               ->  Bitmap Index Scan on idx_t_arr_label  (cost=0.00..17.68 rows=90 width=0) (actual time=14.676..14.676 rows=264 loops=1)
                     Index Cond: (t_arr_label.label @> '{1,2}'::integer[])
                     Buffers: shared hit=155
 Planning time: 0.133 ms
 Execution time: 15.653 ms
(23 rows)

Time: 16.217 ms

Pressure measurement

CONNECTS=56  
TIMES=300  
export PGHOST=$PGDATA  
export PGPORT=1999  
export PGUSER=postgres  
export PGPASSWORD=postgres  
export PGDATABASE=postgres  
  
pgbench -M prepared -n -r -f ./test.sql -P 5 -c $CONNECTS -j $CONNECTS -T $TIMES

8. Test results

transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 56
number of threads: 56
duration: 300 s
number of transactions actually processed: 532217
latency average = 31.565 ms
latency stddev = 5.183 ms
tps = 1773.127087 (including connections establishing)
tps = 1773.172254 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
        31.565  select * from f_test() as t(c1 int, c2 int, c3 int ,cnt int8);