HTAP Database PostgreSQL Scenario and Performance Testing 13 - (OLTP) String Search - Similar Query

Keywords: github PostgreSQL Database SQL

Label

PostgreSQL, HTAP, OLTP, OLAP, Scenario and Performance Testing

background

PostgreSQL is a database with a long history, which can be traced back to 1973. It was first awarded by the 2014 Computer Turing Prize winner and the originator of relational database. Michael_Stonebraker PostgreSQL has similar functions, performance, architecture and stability as Oracle.

The PostgreSQL community has many contributors, from all walks of life around the world. Over the years, PostgreSQL releases a large version every year, known for its enduring vitality and stability.

In October 2017, PostgreSQL released version 10, which carries many amazing features. The goal is to meet the requirements of mixed HTAP scenarios of OLAP and OLTP:

PostgreSQL 10 features, the most popular HTAP database for developers

1. Multi-core Parallel Enhancement

2. fdw aggregation push-down

3. Logical Subscription

4, zoning

5. Multi-copies at the financial level

6. json, jsonb full-text retrieval

7. There are also some features of plug-in form, such as vector computing, JIT, SQL graph computing, SQL stream computing, distributed parallel computing, time series processing, gene sequencing, chemical analysis, image analysis, etc.

PostgreSQL can be seen in various application scenarios:

PostgreSQL has developed very rapidly in recent years. From the database scoring trend of dbranking, a well-known database evaluation website, we can see the upward trend of PostgreSQL:

From the annual community meetings held in PostgreSQL China, we can see the same trend. More and more companies are participating, more and more companies are sharing, and more and more subjects are shared. They span traditional enterprises, Internet, medical, financial, state-owned enterprises, logistics, e-commerce, social networking, vehicle networking, sharing XX, cloud, game, public transport, aviation, railway, military, training. Training, consulting services, etc.

The next series of articles will introduce various application scenarios and corresponding performance indicators of PostgreSQL.

Environmental Science

Reference to Environmental Deployment Method:

PostgreSQL 10 + PostGIS + Sharing (pg_pathman) + MySQL (fdw external table) on ECS Deployment Guide (for new users)

Aliyun ECS: 56 core, 224G, 1.5TB*2 SSD cloud disk.

Operating System: CentOS 7.4 x64

Database version: PostgreSQL 10

PS: The CPU and IO performance of ECS is discounted by physical opportunity, and can be estimated by one-fold performance decline. Running physical mainframe can be estimated by multiplying the performance tested here by 2.

Scenario-String Search-Similarity Query (OLTP)

1, background

String search is a very common business requirement, which includes:

1. Prefix + Fuzzy Query. (b-tree index can be used)

2. Suffix + Fuzzy Query. (b-tree index can be used)

3. Fuzzy query before and after. (You can use pg_trgm and gin indexes)

https://www.postgresql.org/docs/10/static/pgtrgm.html

4. Full-text retrieval. (Full-text search types and gin or rum indexes can be used)

5. Regular query. (You can use pg_trgm and gin indexes)

6. Similar query. (You can use pg_trgm and gin indexes)

Generally speaking, the database does not have the ability to speed up after 3 years, but PostgreSQL is very powerful, it can perfectly support the acceleration of such queries. (Refers to queries and writes that do not conflict, and the index BUILD is real-time.)

Users do not need to synchronize data to search engines, and then query, and search engines can only do full-text retrieval, and you do not do regular, similar, ambiguous these requirements.

PostgreSQL can greatly simplify the user's architecture, cost of development, and ensure the absolute real-time data query.

2, design

There are 100 million texts, each of which is a random string of 128 Chinese characters. Similar queries are made according to randomly provided strings.

Similar queries solve problems that can not be solved by fuzzy queries and full-text retrieval, such as business requirements PostgreSQL, which can match p0stgresl. Because they have enough similarity.

https://www.postgresql.org/docs/10/static/pgtrgm.html

PostgreSQL Text Data Analysis Practice - Similarity Analysis

Talking about Similarity Algorithms - Effective similarity search in PostgreSQL

Talk about the Technology behind Double Eleven - What's a Millisecond Minute, Try Regularity and Similarity

Similar applications include image similarity search:

Reflections on Red Packet Technology in AR Virtual Reality - Perfect Combination of GIS and Image Recognition

Application of PostgreSQL in Video and Picture Reduplication and Image Search Business

"Weak water 3,000, only take a ladle, when image search meets PostgreSQL (Haar Wavet)"

3. Preparing test sheets

create extension pg_trgm;  
  
create table t_likeall (  
  id int,  
  info text  
);  
  
create index idx_t_likeall_1 on t_likeall using gin (info gin_trgm_ops);  
  
-- Setting Similarity Threshold(0-1，floating-point)  
-- select set_limit(0.7);  
  
-- Query records that exceed similar thresholds  
-- SELECT info, similarity(info, 'Character string') AS sml  
-- FROM t_likeall  
-- WHERE info % 'Character string'  -- Find records that exceed similar thresholds  
-- ORDER BY sml DESC;

4. Preparing test functions (optional)

-- Generating Random Chinese Character Strings  
create or replace function gen_hanzi(int) returns text as $$  
declare  
  res text;  
begin  
  if $1 >=1 then  
    select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);  
    return res;  
  end if;  
  return null;  
end;  
$$ language plpgsql strict;

-- Using random strings like Query (for pressure measurement)  
create or replace function get_t_likeall_test(int, real) returns setof record as  
$$  
declare  
  str text;  
begin  
  -- Select an input string  
  select info into str from t_likeall_test where id=$1;  
  
  -- Setting Similarity Threshold  
  perform set_limit($2);  
  
  -- Find records that exceed similar thresholds  
  return query execute 'select *, '''||str||''' as str, similarity(info, '''||str||''') as sml from t_likeall where info % '''||str||''' order by sml DESC limit 1';  
end;  
$$ language plpgsql strict;

5. Preparing test data

insert into t_likeall select id, gen_hanzi(128) from generate_series(1,100000000) t(id);

Generate 2 million random strings and add some noise interference to achieve the purpose of similar queries.

create table t_likeall_test (id serial primary key, info text);  
  
-- Interception of 120 Chinese characters starting at any position, Adding 6 jamming characters, 2 million  
insert into t_likeall_test (info) select substring(info, (random()*10)::int, 120)||gen_hanzi(6) from t_likeall limit 2000000;

6. Preparing test scripts

vi test.sql  
  
\set id random(1,2000000)  
select * from get_t_likeall_test(:id, 0.855) as t(id int , info text, str text, sml real);

7, test

Single similar query efficiency, response time is less than 20 milliseconds. (With bound variables and CACHE hits, the response time is lower. )

postgres=# select * from get_t_likeall_test(2, 0.855) as t(id int , info text, str text, sml real);  
-[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
id   | 2  
info |  Move and subtract silk, scallion, singing fragrance and donating secretly  
str  |  Sing Fangjia's Dictionary with Shallots  
sml  | 0.855072  
  
Time: 19.455 ms

Pressure measurement

CONNECTS=56  
TIMES=300  
export PGHOST=$PGDATA  
export PGPORT=1999  
export PGUSER=postgres  
export PGPASSWORD=postgres  
export PGDATABASE=postgres  
  
pgbench -M prepared -n -r -f ./test.sql -P 5 -c $CONNECTS -j $CONNECTS -T $TIMES

8. Test results

transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 56  
number of threads: 56  
duration: 300 s  
number of transactions actually processed: 459463  
latency average = 36.562 ms  
latency stddev = 8.063 ms  
tps = 1531.259508 (including connections establishing)  
tps = 1531.422742 (excluding connections establishing)  
script statistics:  
 - statement latencies in milliseconds:  
         0.002  \set id random(1,2000000)    
        36.565  select * from get_t_likeall_test(:id, 0.855) as t(id int , info text, str text, sml real);