Development Practice of PostgreSQL Content Random Recommendation System-Random Recommendation of Articles

Keywords: Database SQL PostgreSQL

Label

PostgreSQL, Arrays, Articles, Random Recommendations, Forums, E-Commerce

background

Content recommendation is a common requirement, such as forums, e-commerce, news clients, etc.

Simpler requirements: Edit some content ID s and generate a list of recommendations. (For example, generate a recommendation list like this every day.) Then randomly recommend to the user (while filtering the read content).

Higher recommendation requirements: should be based on different tastes, such as portraits of members themselves, classification. The server generates different recommendation lists for different tastes. Directional recommendation.

This paper introduces the practice of the first requirement. Using PostgreSQL, a medium-sized PG instance (28 cores) can easily reach the recommendation of 500,000 content throughput per second.

DEMO

Take the forum as an example, there are articles, a list of selected articles edited, members, members'reading records. When users open the page, they randomly recommend 20 articles according to the selected list (while filtering the read content).

Suppose the selection list has 2,000 documents and 10 million members.

1. List of articles

create table tbl_art (  
  artid int8,  -- Article ID  
  content text,  -- Article content  
  crt_time timestamp  -- Article creation time  
  -- ...  -- Others, titles, authors,....  
);  

Membership form (omitted).

2. List of ID of Recommended Articles

create table tbl_art_list (  
  list_time timestamp primary key,  -- List generation time  
  artid int8[] not null,   -- What articles are included?( ID),Using Array Storage  
  min_crt_time timestamp not null,  -- Of these articles, the oldest is the time. Derive from tbl_art.crt_time .  Used to clean up the user's reading log.  
  arrlen int not null   -- artid Length (including several elements, i.e. selected articles)  
  -- classid int  If members have a category (label), you can create a selection list by category.
);  

3. Write the selected list and get the last list for recommendation every time you recommend.

If there is a directional requirement (recommended by the Member s tag, change the table tbl_art_list structure, add a CLASS field, and add a CLASS field to the membership table. When members open the page, they match the last list in tbl_art_list with CLASS.

As follows, 2000 selected article ID s are generated. A record.

insert into tbl_art_list values (  
  now(),   
  array(select (random()*1000000)::int8 from generate_series(1,2000)),   
  now(),   
  2000  
) ;  

4. Readed Records

create table tbl_read_rec (  
  uid int8,  -- Member ID  
  crt_time timestamp,  -- Reading time  
  artid int8,  -- Article ID  
  primary key(uid,artid)  -- Primary key (a document, once a member has been read, recorded only once)  
);  
  
create index idx_crt_time_1 on tbl_read_rec (crt_time);  

5. Random access to recommendation article ID

When the user opens the recommendation page, he enters the user ID and how many selected documents are GET (random GET from the selected list).

Returns an array of selected articles from GET and filters out random article ID s that have been read.

create or replace function get_artid(  
  i_uid int8,   -- user ID  
  rows int  -- How many randomly obtained articles ID  
) returns int8[] as $$  
declare   
  v_artid int8[];  -- Selected ID list  
  len int;  -- Selected ID Number of Listed Articles  
  res int8[] := (array[])::int8[];   -- Result  
  tmp int8;  -- Intermediate variables, selected from ID Random articles from the list ID  
  loopi int := 0;  -- Loop variables, how many qualified papers have been obtained ID  
  loopx int := 0;  -- Loop variables, how many times have they been looped (upper limit, depending on selection) ID Number of listing articles, such as 1.5 times len)  
begin   
  select artid,arrlen into v_artid,len   
    from tbl_art_list order by list_time desc limit 1;  -- Get the last item from the Editor's Selected List.  
  loop   
    if loopi >= rows or loopx >= 1.5*len then    -- Have you traversed all the selected articles? ID (Random traversal)  
      return res;    
    end if;  
    tmp := v_artid[floor(random()*len)+1];   -- Selected articles IDs Get random ID  
    perform 1 from tbl_read_rec where uid=i_uid and artid=tmp;   -- Judge whether you have read  
    if not found then  
      res := array_append(res, tmp);  -- Unread, APPEND To return the result  
      loopi := loopi +1 ;  -- Increase progressively  
    end if;  
    loopx := loopx +1 ;  -- Increase progressively  
  end loop;   
  return res;  
end;  
$$ language plpgsql strict;   

6. Cleaning up Reading Records

Use limit to clean up several items at a time.

Skp lock is used to support parallel DELETE.

delete from tbl_read_rec where ctid = any (array(  
  select ctid from tbl_read_rec where crt_time < (select min_crt_time from tbl_art_list order by list_time desc limit 1) limit 10000 for update skip locked  
));  

7, test

GET Selected Article ID (Satisfies Random, Filters Readed).

The performance is OK.

postgres=# select get_artid(1,20);  
                                                                  get_artid                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------  
 {919755,3386,100126,761631,447551,511825,168645,211819,862572,330666,942247,600470,843042,511825,295568,829303,382312,452915,499113,164219}  
(1 row)  
  
Time: 0.377 ms  
postgres=# select get_artid(1,20);  
                                                                 get_artid                                                                    
--------------------------------------------------------------------------------------------------------------------------------------------  
 {257929,796892,343984,363615,418506,326628,91731,958663,127918,794101,49124,410347,852461,922276,366815,926232,134506,153306,123694,67087}  
(1 row)  
  
Time: 0.347 ms  

Assuming the random ID obtained, read it immediately and write all the records obtained. (for pressure measurement)

postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.603 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.494 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.479 ms  
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing;  
INSERT 0 20  
Time: 0.494 ms  

8. Pressure measurement

vi test.sql  
  
\set uid random(1,10000000)  
insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 28 -j 28 -T 120  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 28  
number of threads: 28  
duration: 120 s  
number of transactions actually processed: 3074866  
latency average = 1.093 ms  
latency stddev = 0.577 ms  
tps = 25623.620112 (including connections establishing)  
tps = 25625.634577 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.002  \set uid random(1,10000000)  
         1.091  insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;  

After 120 seconds manometry, the read records were expressed to 3GB.

postgres=# \dt+  
                          List of relations  
 Schema |     Name     | Type  |  Owner   |    Size    | Description   
--------+--------------+-------+----------+------------+-------------  
 public | tbl_art      | table | postgres | 8192 bytes |   
 public | tbl_art_list | table | postgres | 64 kB      |   
 public | tbl_read_rec | table | postgres | 3047 MB    |   
(3 rows)  

61.2 million records have been read.

postgres=# select count(*) from tbl_read_rec ;  
  count     
----------  
 61206909  
(1 row)  

as follows

postgres=# select uid,count(*) from tbl_read_rec group by 1 limit 10;  
 uid | count   
-----+-------  
   1 |  2000  
   6 |    20  
  11 |    20  
  12 |    20  
  14 |    19  
  21 |    20  
  22 |    40  
  26 |    19  
  31 |    20  
  34 |    19  
(10 rows)  

For those who have read all, it is no longer recommended, because the selection list has been read all.

postgres=# select get_artid(1,20);  
 get_artid   
-----------  
 {}  
(1 row)  

performance

tps : 25623

Recommended return per second: 512460 ID

The recommendation section also has optimization space, such as when the user has read the entire selection list (the more read, the slower GET), to get the list will be slower (because the need to traverse the entire list ID, can not get 20 valid records, consume more, and finally return to NULL). (In this case, GET is about 20 milliseconds)

In this case, the suggestion can mark the user to indicate that all the content has been recommended this time (avoid GET, then the performance will rise dramatically, hundreds of thousands of TPS is no problem), and return to other content.

Posted by gva482 on Wed, 15 May 2019 10:36:58 -0700