HTAP Database PostgreSQL Scenario and Performance Testing 21 - (OLTP+OLAP) Sorting and Indexing

Keywords: github PostgreSQL Big Data Database

Label

PostgreSQL, HTAP, OLTP, OLAP, Scenario and Performance Testing

background

PostgreSQL is a database with a long history, which can be traced back to 1973. It was first awarded by the 2014 Computer Turing Prize winner and the originator of relational database. Michael_Stonebraker PostgreSQL has similar functions, performance, architecture and stability as Oracle.

The PostgreSQL community has many contributors, from all walks of life around the world. Over the years, PostgreSQL releases a large version every year, known for its enduring vitality and stability.

In October 2017, PostgreSQL released version 10, which carries many amazing features. The goal is to meet the requirements of mixed HTAP scenarios of OLAP and OLTP:

PostgreSQL 10 features, the most popular HTAP database for developers

1. Multi-core Parallel Enhancement

2. fdw aggregation push-down

3. Logical Subscription

4, zoning

5. Multi-copies at the financial level

6. json, jsonb full-text retrieval

7. There are also some features of plug-in form, such as vector computing, JIT, SQL graph computing, SQL stream computing, distributed parallel computing, time series processing, gene sequencing, chemical analysis, image analysis, etc.

PostgreSQL can be seen in various application scenarios:

PostgreSQL has developed very rapidly in recent years. From the database scoring trend of dbranking, a well-known database evaluation website, we can see the upward trend of PostgreSQL:

From the annual community meetings held in PostgreSQL China, we can see the same trend. More and more companies are participating, more and more companies are sharing, and more and more subjects are shared. They span traditional enterprises, Internet, medical, financial, state-owned enterprises, logistics, e-commerce, social networking, vehicle networking, sharing XX, cloud, game, public transport, aviation, railway, military, training. Training, consulting services, etc.

The next series of articles will introduce various application scenarios and corresponding performance indicators of PostgreSQL.

Environmental Science

Reference to Environmental Deployment Method:

PostgreSQL 10 + PostGIS + Sharing (pg_pathman) + MySQL (fdw external table) on ECS Deployment Guide (for new users)

Aliyun ECS: 56 core, 224G, 1.5TB*2 SSD cloud disk.

Operating System: CentOS 7.4 x64

Database version: PostgreSQL 10

PS: The CPU and IO performance of ECS is discounted by physical opportunity, and can be estimated by one-fold performance decline. Running physical mainframe can be estimated by multiplying the performance tested here by 2.

Scenario-Sorting, Indexing (OLTP+OLAP)

1, background

1. In OLTP, sorting is usually the sorting of small data sets.

2. In OLAP, it is usually necessary to sort large result sets and select TOP.

3. The speed of sorting also directly affects the speed of index building. (Note that PostgreSQL supports CONCURRENTLY to build indexes without blocking DML.)

4. In OLAP, there are some simple and efficient methods to extract TOP: estimation, HLL, sampling, statistical information, etc.

"Second-Level Arbitrary Dimension Analysis 1TB-Level Large Table-Satisfying the Statistical Analysis Requirements of Efficient TOP N by Sampling Valuation"

Greenplum Best Practices - Use of Valuation Plug-ins hll (and optimization of hll fractional aggregation functions)

<PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 1>

<PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 2>

<PostgreSQL hll (HyperLogLog) extension for "State of The Art Cardinality Estimation Algorithm" - 3>

Quick Line Estimation Using explain Plan Rows

2, design

1. Sort small data, 10,000 pieces of data.

2. Sorting large amount of data and 100 million pieces of data.

3. The efficiency of index creation with large amount of data is 100 million pieces of data.

4. Big data valuation TOP N, 100 million data.

5. Big data, composite index, seek the ID of TOP c2 of any GROUP C 1, 100 million data.

3. Preparing test sheets

create table t_small (id int);  
  
create table t_large (id int);  
  
create table t_estimate (id int);  
  
create table t_idx_test (id int, c1 int, c2 int);  

4. Preparing test functions (optional)

5. Preparing test data

1. Sort small data, 10,000 pieces of data.

insert into t_small select generate_series(1,10000);  

2. Sorting large amount of data and 100 million pieces of data.

insert into t_large select generate_series(1,100000000);  

4. Big data valuation TOP N, 100 million data.

vi test.sql  
  
\set id random_gaussian(1, 1000000, 2.5)  
insert into t_estimate values (:id);  
pgbench -M prepared -n -r -P 5 -f ./test.sql -c 50 -j 50 -t 2000000  
analyze t_estimate;  

5. Big data, composite index, seek the ID of TOP c2 of any GROUP C 1, 100 million data.

insert into t_idx_test select id, random()*1000, random()*900000 from generate_series(1,100000000) t(id);  
create index idx_t_idx_test on t_idx_test (c1,c2);  

6. Preparing test scripts

vi test.sql  
  
\set c1 random(1,1000)  
select id from t_idx_test where c1=:c1 order by c2 desc limit 1;  

Pressure measurement

CONNECTS=112  
TIMES=300  
export PGHOST=$PGDATA  
export PGPORT=1999  
export PGUSER=postgres  
export PGPASSWORD=postgres  
export PGDATABASE=postgres  
  
pgbench -M prepared -n -r -f ./test.sql -P 5 -c $CONNECTS -j $CONNECTS -T $TIMES  

7, test

1. Sorting of Small Data Quantity

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_small order by id desc limit 1;  
                                                          QUERY PLAN  
------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=195.00..195.00 rows=1 width=4) (actual time=2.580..2.581 rows=1 loops=1)  
   Output: id  
   Buffers: shared hit=45  
   ->  Sort  (cost=195.00..220.00 rows=10000 width=4) (actual time=2.580..2.580 rows=1 loops=1)  
         Output: id  
         Sort Key: t_small.id DESC  
         Sort Method: top-N heapsort  Memory: 25kB  
         Buffers: shared hit=45  
         ->  Seq Scan on public.t_small  (cost=0.00..145.00 rows=10000 width=4) (actual time=0.006..1.184 rows=10000 loops=1)  
               Output: id  
               Buffers: shared hit=45  
 Planning time: 0.027 ms  
 Execution time: 2.591 ms  
(13 rows)  

2. Sorting of Large Data Volume

alter table t_large set (parallel_workers =32);  
set parallel_setup_cost =0;  
set parallel_tuple_cost =0;  
set max_parallel_workers_per_gather =32;  
set work_mem ='1GB';  
explain select * from t_large order by id desc limit 1;  
  
postgres=# explain select * from t_large order by id desc limit 1;  
                                         QUERY PLAN  
---------------------------------------------------------------------------------------------  
 Limit  (cost=810844.97..810845.00 rows=1 width=4)  
   ->  Gather Merge  (cost=810844.97..3590855.42 rows=100000032 width=4)  
         Workers Planned: 32  
         ->  Sort  (cost=810844.14..818656.64 rows=3125001 width=4)  
               Sort Key: id DESC  
               ->  Parallel Seq Scan on t_large  (cost=0.00..473728.01 rows=3125001 width=4)  
(6 rows)  
  
postgres=# select * from t_large order by id desc limit 1;  
    id  
-----------  
 100000000  
(1 row)  
Time: 1482.964 ms (00:01.483)  

3. Index Creation Efficiency for Large Data Volume

postgres=# create index idx_t_large_id on t_large(id);  
CREATE INDEX  
Time: 37937.482 ms (00:37.937)  

4. TOP N of Big Data Valuation

postgres=# select most_common_vals, most_common_freqs from pg_stats where tablename='t_estimate';  
-[ RECORD 1 ]-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
most_common_vals  | {354316,354445,404899,431899,451975,525707,545934,552219,631936,654703,686785,52824,97919,100231,134912,137688,158267,161541,171349,182376,182892,186086,192265,196224,197934,206937,207098,208325,213459,218788,218939,225221,226377,238291,239857,245513,245868,250632,250836,251535,251972,254658,254998,255236,256667,259600,260215,263041,266027,268086,271091,271490,271520,272019,272459,282086,286285,287848,288015,288233,288310,288344,288605,289181,289901,291581,296327,301385,301631,304765,304923,306094,306309,307188,312000,313190,313449,315581,317808,320374,320769,322517,322889,323389,326463,326738,330239,331553,334323,335451,335588,337521,338605,342766,344188,344662,344730,345081,345096,346053}  
most_common_freqs | {0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05,6.66667e-05}  
  
Time: 1.274 ms  

5. Big data, composite index, seek the ID of TOP c2 of any GROUP C 1, 100 million data.

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 112  
number of threads: 112  
duration: 300 s  
number of transactions actually processed: 203376551  
latency average = 0.165 ms  
latency stddev = 0.324 ms  
tps = 677702.381595 (including connections establishing)  
tps = 677897.901150 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.002  \set c1 random(1,1000)  
         0.168  select id from t_idx_test where c1=:c1 order by c2 desc limit 1;  

TPS: 677897

5. Big data, composite index, seek the ID of TOP c2 of any GROUP C 1, 100 million data. TPS:677897

Average response time: 0.165 milliseconds

1. Sorting small amount of data, 10,000 pieces of data, Sorting takes 2.6 milliseconds.

2. Sorting large amount of data, 100 million pieces of data, and sorting takes 1.48 seconds.

3. It takes 37.9 seconds to create index with 100 million pieces of data.

4. Big data estimates TOP N, 1 data, estimates TOP N, takes 1.27 milliseconds.

5. Big data, composite index, seek the ID of TOP c2 of any GROUP C 1, 100 million data. Average response time: 0.165 milliseconds.

Reference resources

PostgreSQL, Greenplum Application Case Book "Tathagata Palm" - Catalogue

"Database Selection - Elephant Eighteen Touches - To Architects and Developers"

PostgreSQL uses pgbench to test sysbench related case s

"Hua Shan Lun Jian tpc.org in the field of database"

https://www.postgresql.org/docs/10/static/pgbench.html

Posted by renny on Sun, 23 Dec 2018 16:18:06 -0800