HTAP Database PostgreSQL Scenario and Performance Testing 16 - (OLTP) Text Feature Vector - Similar Feature (Heming...) Query

Keywords: github PostgreSQL Database SQL

Label

PostgreSQL, HTAP, OLTP, OLAP, Scenario and Performance Testing

background

PostgreSQL is a database with a long history, which can be traced back to 1973. It was first awarded by the 2014 Computer Turing Prize winner and the originator of relational database. Michael_Stonebraker PostgreSQL has similar functions, performance, architecture and stability as Oracle.

The PostgreSQL community has many contributors, from all walks of life around the world. Over the years, PostgreSQL releases a large version every year, known for its enduring vitality and stability.

In October 2017, PostgreSQL released version 10, which carries many amazing features. The goal is to meet the requirements of mixed HTAP scenarios of OLAP and OLTP:

PostgreSQL 10 features, the most popular HTAP database for developers

1. Multi-core Parallel Enhancement

2. fdw aggregation push-down

3. Logical Subscription

4, zoning

5. Multi-copies at the financial level

6. json, jsonb full-text retrieval

7. There are also some features of plug-in form, such as vector computing, JIT, SQL graph computing, SQL stream computing, distributed parallel computing, time series processing, gene sequencing, chemical analysis, image analysis, etc.

PostgreSQL can be seen in various application scenarios:

PostgreSQL has developed very rapidly in recent years. From the database scoring trend of dbranking, a well-known database evaluation website, we can see the upward trend of PostgreSQL:

From the annual community meetings held in PostgreSQL China, we can see the same trend. More and more companies are participating, more and more companies are sharing, and more and more subjects are shared. They span traditional enterprises, Internet, medical, financial, state-owned enterprises, logistics, e-commerce, social networking, vehicle networking, sharing XX, cloud, game, public transport, aviation, railway, military, training. Training, consulting services, etc.

The next series of articles will introduce various application scenarios and corresponding performance indicators of PostgreSQL.

Environmental Science

Reference to Environmental Deployment Method:

PostgreSQL 10 + PostGIS + Sharing (pg_pathman) + MySQL (fdw external table) on ECS Deployment Guide (for new users)

Aliyun ECS: 56 core, 224G, 1.5TB*2 SSD cloud disk.

Operating System: CentOS 7.4 x64

Database version: PostgreSQL 10

PS: The CPU and IO performance of ECS is discounted by physical opportunity, and can be estimated by one-fold performance decline. Running physical mainframe can be estimated by multiplying the performance tested here by 2.

Scene-Text Feature Vector-Similarity Feature (Heming...) Query (OLTP)

1, background

For long text, or some longer text, if we want to search for text with similar semantics, it is not appropriate to use full-text retrieval and fuzzy query, which can not meet the needs.

The usual method is to extract the feature words of the text and search similar text according to the feature.

For example, the closer the Hamming distance between different texts is, the more similar it is.

"Massive Data, Simhash Distance Efficient Retrieval (smlar) - Aliyun RDS PosgreSQL Best Practices"

2, design

One hundred million Hamming codes of text eigenvectors are input into any Hamming codes to obtain similar records.

100 million Hamming codes, searching for records within 3 distances from the specified Hamming codes.

3. Preparing test sheets

create extension smlar;  
  
create table hm3 (id int, hmval bit(64), hmarr text[]);

4. Preparing test functions (optional)

Functions for generating random Hamming codes

create or replace function gen_rand_bit() returns bit(64) as $$  
  select (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64);  
$$ language sql strict;  
  
create or replace function gen_arr(text) returns text[] as $$  
  select regexp_split_to_array('1_'||substring($1,1,16)||',2_'||substring($1,17,16)||',3_'||substring($1,33,16)||',4_'||substring($1,41,16), ',') ;  
$$ language sql strict;

Functions for testing records within 3 distances between search and specified Hamming codes

create or replace function f_test () returns setof record as $$  
declare  
  ts text;  
  arr text[];  
begin  
  set smlar.type = overlap;  
  set smlar.threshold = 3;  
  set LOCAL enable_seqscan=off;  
  
  select gen_rand_bit()::text into ts;  
  select gen_arr(ts) into arr;  
  return query select  
    *,  
    smlar( hmarr, arr)  
  from  
    hm3  
  where  
    hmarr % arr  
    and length(replace(bitxor(ts::bit(64), hmval)::text,'0','')) < 2  
  limit 1;  
end;  
$$ language plpgsql strict;

5. Preparing test data

insert into hm3  
select  
  id,  
  val::bit(64),  
  regexp_split_to_array('1_'||substring(val,1,16)||',2_'||substring(val,17,16)||',3_'||substring(val,33,16)||',4_'||substring(val,41,16), ',')  
from  
(select id, (sqrt(random())::numeric*9223372036854775807*2-9223372036854775807::numeric)::int8::bit(64)::text as val from generate_series(1,100000000) t(id)) t;  
  
create index idx_hm3 on hm3 using gin(hmarr _text_sml_ops );

6. Preparing test scripts

vi test.sql  
  
select * from f_test() as t(id int, hmval bit(64), hmarr text[], dist real);

7, test

Single similar query efficiency, response time is less than 2 milliseconds. (With bound variables and CACHE hits, the response time is lower. )

select  
    *,  
    smlar( hmarr, '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[])  
  from  
    hm3  
  where  
    hmarr % '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[]  
    and length(replace(bitxor('0000000010010010110011011001010011011101000110111111111001111111'::bit(64), hmval)::text,'0','')) < 2  
  limit 1;  
  
 id |                              hmval                               |                                     hmarr                                     | smlar  
----+------------------------------------------------------------------+-------------------------------------------------------------------------------+-------  
  1 | 0000000010010010110011011001010011011101000110111111111001111110 | {1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111110} |     3  
(1 row)  
  
Time: 1.335 ms  
  
postgres=# explain (analyze,verbose,timing,costs,buffers) select  
    *,  
    smlar( hmarr, '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[])  
  from  
    hm3  
  where  
    hmarr % '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[]  
    and length(replace(bitxor('0000000010010010110011011001010011011101000110111111111001111111'::bit(64), hmval)::text,'0','')) < 2  
  limit 1;  
                                                                             QUERY PLAN  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=811.33..814.35 rows=1 width=138) (actual time=0.563..0.563 rows=1 loops=1)  
   Output: id, hmval, hmarr, (smlar(hmarr, '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[]))  
   Buffers: shared hit=19  
   ->  Bitmap Heap Scan on public.hm3  (cost=811.33..101253.67 rows=33333 width=138) (actual time=0.561..0.561 rows=1 loops=1)  
         Output: id, hmval, hmarr, smlar(hmarr, '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[])  
         Recheck Cond: (hm3.hmarr % '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[])  
         Filter: (length(replace((bitxor(B'0000000010010010110011011001010011011101000110111111111001111111'::bit(64), hm3.hmval))::text, '0'::text, ''::text)) < 2)  
         Heap Blocks: exact=1  
         Buffers: shared hit=19  
         ->  Bitmap Index Scan on idx_hm3  (cost=0.00..803.00 rows=100000 width=0) (actual time=0.538..0.538 rows=1 loops=1)  
               Index Cond: (hm3.hmarr % '{1_0000000010010010,2_1100110110010100,3_1101110100011011,4_0001101111111111}'::text[])  
               Buffers: shared hit=18  
 Planning time: 0.134 ms  
 Execution time: 0.602 ms  
(14 rows)  
  
Time: 1.269 ms

Pressure measurement

CONNECTS=56  
TIMES=300  
export PGHOST=$PGDATA  
export PGPORT=1999  
export PGUSER=postgres  
export PGPASSWORD=postgres  
export PGDATABASE=postgres  
  
pgbench -M prepared -n -r -f ./test.sql -P 5 -c $CONNECTS -j $CONNECTS -T $TIMES

8. Test results

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 300 s  
number of transactions actually processed: 14721374  
latency average = 1.140 ms  
latency stddev = 0.590 ms  
tps = 49053.614018 (including connections establishing)  
tps = 49054.615079 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         1.139  select * from f_test() as t(id int, hmval bit(64), hmarr text[], dist real);