Label
PostgreSQL, HTAP, OLTP, OLAP, Scenario and Performance Testing
background
PostgreSQL is a database with a long history, which can be traced back to 1973. It was first awarded by the 2014 Computer Turing Prize winner and the originator of relational database. Michael_Stonebraker PostgreSQL has similar functions, performance, architecture and stability as Oracle.
The PostgreSQL community has many contributors, from all walks of life around the world. Over the years, PostgreSQL releases a large version every year, known for its enduring vitality and stability.
In October 2017, PostgreSQL released version 10, which carries many amazing features. The goal is to meet the requirements of mixed HTAP scenarios of OLAP and OLTP:
PostgreSQL 10 features, the most popular HTAP database for developers
1. Multi-core Parallel Enhancement
2. fdw aggregation push-down
3. Logical Subscription
4, zoning
5. Multi-copies at the financial level
6. json, jsonb full-text retrieval
7. There are also some features of plug-in form, such as vector computing, JIT, SQL graph computing, SQL stream computing, distributed parallel computing, time series processing, gene sequencing, chemical analysis, image analysis, etc.
PostgreSQL can be seen in various application scenarios:
PostgreSQL has developed very rapidly in recent years. From the database scoring trend of dbranking, a well-known database evaluation website, we can see the upward trend of PostgreSQL:
From the annual community meetings held in PostgreSQL China, we can see the same trend. More and more companies are participating, more and more companies are sharing, and more and more subjects are shared. They span traditional enterprises, Internet, medical, financial, state-owned enterprises, logistics, e-commerce, social networking, vehicle networking, sharing XX, cloud, game, public transport, aviation, railway, military, training. Training, consulting services, etc.
The next series of articles will introduce various application scenarios and corresponding performance indicators of PostgreSQL.
Environmental Science
Reference to Environmental Deployment Method:
Aliyun ECS: 56 core, 224G, 1.5TB*2 SSD cloud disk.
Operating System: CentOS 7.4 x64
Database version: PostgreSQL 10
PS: The CPU and IO performance of ECS is discounted by physical opportunity, and can be estimated by one-fold performance decline. Running physical mainframe can be estimated by multiplying the performance tested here by 2.
Scene - user portrait circle scene - multiple fields arbitrary combination condition screening and perspective (OLAP)
1, background
User portrait table has many fields, representing different types of label attributes. In crowd circle selection, the combination conditions of any field need to be filtered, and the crowd results need to be perspective.
PostgreSQL has three ways to achieve arbitrary combination filtering of multiple fields.
1. Bloom filtering, which supports the equivalence query of any combination of fields.
2. Multi-index bitmap scan
Bitmapscan can be implemented with gin composite index or multiple b-tree single column indexes.
When multiple conditions are input, filtering and convergence to fewer data blocks, sequential scanning + FILTER.
<PostgreSQL bitmapAnd, bitmapOr, bitmap index scan, bitmap heap scan>
3. GIN composite index bitmap scan
When multiple conditions are input, filtering and convergence to fewer data blocks, sequential scanning + FILTER.
2, design
100 million records, each record contains 32 label fields, each field label value range of 10,000. It also contains three attribute fields for perspective.
3. Preparing test sheets
do language plpgsql $$ declare sql text; begin sql := 'create table t_multi_col (id int8, c1 int default random()*100, c2 int default random()*10, c3 int default random()*10, '; for i in 4..35 loop sql := sql||'c'||i||' int default random()*10000,'; end loop; sql := rtrim(sql, ','); sql := sql||')'; execute sql; end; $$;
4. Preparing test functions (optional)
5. Preparing test data
insert into t_multi_col (id) select generate_series(1,100000000);
1. Bloom Index
create extension bloom; do language plpgsql $$ declare sql text; begin sql := 'create index idx_t_multi_col on t_multi_col using bloom ('; for i in 4..35 loop sql := sql||'c'||i||','; end loop; sql := rtrim(sql, ','); sql := sql||') with (length=80, '; for i in 1..32 loop sql := sql||'col'||i||'=2,'; end loop; sql := rtrim(sql, ','); sql := sql||')'; execute sql; end; $$;
2. GIN Index
create extension btree_gin; do language plpgsql $$ declare sql text; begin sql := 'create index idx_t_multi_col_gin on t_multi_col using gin ('; for i in 4..35 loop sql := sql||'c'||i||','; end loop; sql := rtrim(sql, ','); sql := sql||')'; execute sql; end; $$;
6. Preparing test scripts
vi test.sql \set a4 random(1,10000) \set a5 random(1,10000) \set a6 random(1,10000) \set a7 random(1,10000) \set a8 random(1,10000) \set a9 random(1,10000) select c1,c2,c3,count(*) from t_multi_col where c4=:a4 and c5=:a5 and c6=:a6 and c7=:a7 and c8=:a8 and c9=:a9 group by grouping sets ((c1),(c2),(c3));
7, test
1. Bloom index takes a little more time because it needs to sweep the whole index. 500 milliseconds.
postgres=# explain (analyze,verbose,timing,costs,buffers) select c1,c2,c3,count(*) from t_multi_col where c4=3 and c5=2 and c6=1 and c7=4 and c8=5 and c9=6 and c10=1 and c11=1 and c12=1 group by grouping sets ((c1),(c2),(c3)); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=2985297.24..2985297.28 rows=3 width=20) (actual time=499.961..499.961 rows=0 loops=1) Output: c1, c2, c3, count(*) Hash Key: t_multi_col.c1 Hash Key: t_multi_col.c2 Hash Key: t_multi_col.c3 Buffers: shared hit=197418 -> Bitmap Heap Scan on public.t_multi_col (cost=2985296.00..2985297.23 rows=1 width=12) (actual time=499.958..499.958 rows=0 loops=1) Output: id, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35 Recheck Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1)) Rows Removed by Index Recheck: 1339 Heap Blocks: exact=1339 Buffers: shared hit=197418 -> Bitmap Index Scan on idx_t_multi_col (cost=0.00..2985296.00 rows=1 width=0) (actual time=497.718..497.718 rows=1339 loops=1) Index Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1)) Buffers: shared hit=196079 Planning time: 0.165 ms Execution time: 500.025 ms (17 rows)
2. gin index, accurate positioning, takes less than 2 milliseconds.
postgres=# explain (analyze,verbose,timing,costs,buffers) select c1,c2,c3,count(*) from t_multi_col where c4=3 and c5=2 and c6=1 and c7=4 and c8=5 and c9=6 and c10=1 and c11=1 and c12=1 group by grouping sets ((c1),(c2),(c3)); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=69.64..69.68 rows=3 width=20) (actual time=1.151..1.151 rows=0 loops=1) Output: c1, c2, c3, count(*) Hash Key: t_multi_col.c1 Hash Key: t_multi_col.c2 Hash Key: t_multi_col.c3 Buffers: shared hit=69 -> Bitmap Heap Scan on public.t_multi_col (cost=68.40..69.63 rows=1 width=12) (actual time=1.149..1.149 rows=0 loops=1) Output: id, c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35 Recheck Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1)) Buffers: shared hit=69 -> Bitmap Index Scan on idx_t_multi_col_gin (cost=0.00..68.40 rows=1 width=0) (actual time=1.146..1.146 rows=0 loops=1) Index Cond: ((t_multi_col.c4 = 3) AND (t_multi_col.c5 = 2) AND (t_multi_col.c6 = 1) AND (t_multi_col.c7 = 4) AND (t_multi_col.c8 = 5) AND (t_multi_col.c9 = 6) AND (t_multi_col.c10 = 1) AND (t_multi_col.c11 = 1) AND (t_multi_col.c12 = 1)) Buffers: shared hit=69 Planning time: 0.263 ms Execution time: 1.245 ms (15 rows)
Pressure measurement
CONNECTS=56 TIMES=300 export PGHOST=$PGDATA export PGPORT=1999 export PGUSER=postgres export PGPASSWORD=postgres export PGDATABASE=postgres pgbench -M prepared -n -r -f ./test.sql -P 5 -c $CONNECTS -j $CONNECTS -T $TIMES
8. Test results
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 56 number of threads: 56 duration: 300 s number of transactions actually processed: 10740407 latency average = 1.564 ms latency stddev = 0.561 ms tps = 35796.375710 (including connections establishing) tps = 35800.169989 (excluding connections establishing) script statistics: - statement latencies in milliseconds: 0.002 \set a4 random(1,10000) 0.000 \set a5 random(1,10000) 0.000 \set a6 random(1,10000) 0.000 \set a7 random(1,10000) 0.000 \set a8 random(1,10000) 0.000 \set a9 random(1,10000) 1.562 select c1,c2,c3,count(*) from t_multi_col where c4=:a4 and c5=:a5 and c6=:a6 and c7=:a7 and c8=:a8 and c9=:a9 group by grouping sets ((c1),(c2),(c3));
TPS: 35800
Average response time: 1.564 milliseconds
In fact, in addition to BITMAPSCAN, there is also a storage layer optimization. At present, PostgreSQL internal engine is a row storage engine, which supports column storage through plug-ins. Column storage optimization can reduce the number of data blocks scanned and improve performance.
Reference resources
PostgreSQL, Greenplum Application Case Book "Tathagata Palm" - Catalogue
"Database Selection - Elephant Eighteen Touches - To Architects and Developers"
PostgreSQL uses pgbench to test sysbench related case s