HTAP Database PostgreSQL Scenario and Performance Testing 20 - (OLAP) User Portrait Circle Scenario - Conditions Screening and Perspective of Arbitrary Combination of Multiple Fields

Keywords: SQL github PostgreSQL Database

Label

PostgreSQL, HTAP, OLTP, OLAP, Scenario and Performance Testing

background

PostgreSQL is a database with a long history, which can be traced back to 1973. It was first awarded by the 2014 Computer Turing Prize winner and the originator of relational database. Michael_Stonebraker PostgreSQL has similar functions, performance, architecture and stability as Oracle.

The PostgreSQL community has many contributors, from all walks of life around the world. Over the years, PostgreSQL releases a large version every year, known for its enduring vitality and stability.

In October 2017, PostgreSQL released version 10, which carries many amazing features. The goal is to meet the requirements of mixed HTAP scenarios of OLAP and OLTP:

PostgreSQL 10 features, the most popular HTAP database for developers

1. Multi-core Parallel Enhancement

2. fdw aggregation push-down

3. Logical Subscription

4, zoning

5. Multi-copies at the financial level

6. json, jsonb full-text retrieval

7. There are also some features of plug-in form, such as vector computing, JIT, SQL graph computing, SQL stream computing, distributed parallel computing, time series processing, gene sequencing, chemical analysis, image analysis, etc.

PostgreSQL can be seen in various application scenarios:

PostgreSQL has developed very rapidly in recent years. From the database scoring trend of dbranking, a well-known database evaluation website, we can see the upward trend of PostgreSQL:

From the annual community meetings held in PostgreSQL China, we can see the same trend. More and more companies are participating, more and more companies are sharing, and more and more subjects are shared. They span traditional enterprises, Internet, medical, financial, state-owned enterprises, logistics, e-commerce, social networking, vehicle networking, sharing XX, cloud, game, public transport, aviation, railway, military, training. Training, consulting services, etc.

The next series of articles will introduce various application scenarios and corresponding performance indicators of PostgreSQL.

Environmental Science

Reference to Environmental Deployment Method:

PostgreSQL 10 + PostGIS + Sharing (pg_pathman) + MySQL (fdw external table) on ECS Deployment Guide (for new users)

Aliyun ECS: 56 core, 224G, 1.5TB*2 SSD cloud disk.

Operating System: CentOS 7.4 x64

Database version: PostgreSQL 10

PS: The CPU and IO performance of ECS is discounted by physical opportunity, and can be estimated by one-fold performance decline. Running physical mainframe can be estimated by multiplying the performance tested here by 2.

Scene - user portrait circle scene - multiple fields arbitrary combination condition screening and perspective (OLAP)

1, background

User portrait table has many fields, representing different types of label attributes. In crowd circle selection, the combination conditions of any field need to be filtered, and the crowd results need to be perspective.

PostgreSQL has three ways to achieve arbitrary combination filtering of multiple fields.

1. Bloom filtering, which supports the equivalence query of any combination of fields.

PostgreSQL 9.6 Black Technology bloom Algorithmic Index, An Index Supports Arbitrary Column Combination Query

2. Multi-index bitmap scan

Bitmapscan can be implemented with gin composite index or multiple b-tree single column indexes.

When multiple conditions are input, filtering and convergence to fewer data blocks, sequential scanning + FILTER.

3. GIN composite index bitmap scan

When multiple conditions are input, filtering and convergence to fewer data blocks, sequential scanning + FILTER.

"Sword for Heroes - Equivalent Query of Arbitrary Combination Fields, Exploring PostgreSQL Multi-column Expansion B Tree (GIN)"

2, design

100 million records, each record contains 32 label fields, each field label value range of 10,000. It also contains three attribute fields for perspective.

3. Preparing test sheets

do language plpgsql $$  
declare  
  sql text;  
begin  
  sql := 'create table t_multi_col (id int8, c1 int default random()*100, c2 int default random()*10, c3 int default random()*10, ';  
  for i in 4..35 loop  
    sql := sql||'c'||i||' int default random()*10000,';  
  end loop;  
  sql := rtrim(sql, ',');  
  sql := sql||')';  
  execute sql;  
end;  
$$;

4. Preparing test functions (optional)

5. Preparing test data

insert into t_multi_col (id) select generate_series(1,100000000);

1. Bloom Index

create extension bloom;  
  
do language plpgsql $$  
declare  
  sql text;  
begin  
  sql := 'create index idx_t_multi_col on t_multi_col using bloom (';  
  for i in 4..35 loop  
    sql := sql||'c'||i||',';  
  end loop;  
  sql := rtrim(sql, ',');  
  sql := sql||') with (length=80, ';  
  for i in 1..32 loop  
    sql := sql||'col'||i||'=2,';  
  end loop;  
  sql := rtrim(sql, ',');  
  sql := sql||')';  
  execute sql;  
end;  
$$;

2. GIN Index

create extension btree_gin;  
  
do language plpgsql $$  
declare  
  sql text;  
begin  
  sql := 'create index idx_t_multi_col_gin on t_multi_col using gin (';  
  for i in 4..35 loop  
    sql := sql||'c'||i||',';  
  end loop;  
  sql := rtrim(sql, ',');  
  sql := sql||')';  
  execute sql;  
end;  
$$;

6. Preparing test scripts

vi test.sql  
  
\set a4 random(1,10000)  
\set a5 random(1,10000)  
\set a6 random(1,10000)  
\set a7 random(1,10000)  
\set a8 random(1,10000)  
\set a9 random(1,10000)  
select c1,c2,c3,count(*) from t_multi_col where c4=:a4 and c5=:a5 and c6=:a6 and c7=:a7 and c8=:a8 and c9=:a9 group by grouping sets ((c1),(c2),(c3));

7, test

1. Bloom index takes a little more time because it needs to sweep the whole index. 500 milliseconds.

postgres=# explain (analyze,verbose,timing,costs,buffers) select c1,c2,c3,count(*) from t_multi_col where c4=3 and c5=2 and c6=1 and c7=4 and c8=5 and c9=6 and c10=1 and c11=1 and c12=1 group by grouping sets ((c1),(c2),(c3));  
                                                      QUERY PLAN  
-----------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=2985297.24..2985297.28 rows=3 width=20) (actual time=499.961..499.961 rows=0 loops=1)  
   Output: c1, c2, c3, count(*)  
   Hash Key: t_multi_col.c1  
   Hash Key: t_multi_col.c2  
   Hash Key: t_multi_col.c3  
   Buffers: shared hit=197418  
   ->  Bitmap Heap Scan on public.t_multi_col  (cost=2985296.00..2985297.23 rows=1 width=12) (actual time=499.958..499.958 rows=0 loops=1)  
         Output: id, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35  
         Recheck Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1))  
         Rows Removed by Index Recheck: 1339  
         Heap Blocks: exact=1339  
         Buffers: shared hit=197418  
         ->  Bitmap Index Scan on idx_t_multi_col  (cost=0.00..2985296.00 rows=1 width=0) (actual time=497.718..497.718 rows=1339 loops=1)  
               Index Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1))  
               Buffers: shared hit=196079  
 Planning time: 0.165 ms  
 Execution time: 500.025 ms  
(17 rows)

2. gin index, accurate positioning, takes less than 2 milliseconds.

postgres=# explain (analyze,verbose,timing,costs,buffers) select c1,c2,c3,count(*) from t_multi_col where c4=3 and c5=2 and c6=1 and c7=4 and c8=5 and c9=6 and c10=1 and c11=1 and c12=1 group by grouping sets ((c1),(c2),(c3));  
                                  QUERY PLAN  
--------------------------------------------------------------------------------------------------------------------------------  
 HashAggregate  (cost=69.64..69.68 rows=3 width=20) (actual time=1.151..1.151 rows=0 loops=1)  
   Output: c1, c2, c3, count(*)  
   Hash Key: t_multi_col.c1  
   Hash Key: t_multi_col.c2  
   Hash Key: t_multi_col.c3  
   Buffers: shared hit=69  
   ->  Bitmap Heap Scan on public.t_multi_col  (cost=68.40..69.63 rows=1 width=12) (actual time=1.149..1.149 rows=0 loops=1)  
         Output: id, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35  
         Recheck Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1))  
         Buffers: shared hit=69  
         ->  Bitmap Index Scan on idx_t_multi_col_gin  (cost=0.00..68.40 rows=1 width=0) (actual time=1.146..1.146 rows=0 loops=1)  
               Index Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1))  
               Buffers: shared hit=69  
 Planning time: 0.263 ms  
 Execution time: 1.245 ms  
(15 rows)

Pressure measurement

CONNECTS=56  
TIMES=300  
export PGHOST=$PGDATA  
export PGPORT=1999  
export PGUSER=postgres  
export PGPASSWORD=postgres  
export PGDATABASE=postgres  
  
pgbench -M prepared -n -r -f ./test.sql -P 5 -c $CONNECTS -j $CONNECTS -T $TIMES

8. Test results

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 300 s  
number of transactions actually processed: 10740407  
latency average = 1.564 ms  
latency stddev = 0.561 ms  
tps = 35796.375710 (including connections establishing)  
tps = 35800.169989 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.002  \set a4 random(1,10000)  
         0.000  \set a5 random(1,10000)  
         0.000  \set a6 random(1,10000)  
         0.000  \set a7 random(1,10000)  
         0.000  \set a8 random(1,10000)  
         0.000  \set a9 random(1,10000)  
         1.562  select c1,c2,c3,count(*) from t_multi_col where c4=:a4 and c5=:a5 and c6=:a6 and c7=:a7 and c8=:a8 and c9=:a9 group by grouping sets ((c1),(c2),(c3));

TPS: 35800

Average response time: 1.564 milliseconds

In fact, in addition to BITMAPSCAN, there is also a storage layer optimization. At present, PostgreSQL internal engine is a row storage engine, which supports column storage through plug-ins. Column storage optimization can reduce the number of data blocks scanned and improve performance.