Label
PostgreSQL, Arrays, Articles, Random Recommendations, Forums, E-Commerce
background
Content recommendation is a common requirement, such as forums, e-commerce, news clients, etc.
Simpler requirements: Edit some content ID s and generate a list of recommendations. (For example, generate a recommendation list like this every day.) Then randomly recommend to the user (while filtering the read content).
Higher recommendation requirements: should be based on different tastes, such as portraits of members themselves, classification. The server generates different recommendation lists for different tastes. Directional recommendation.
This paper introduces the practice of the first requirement. Using PostgreSQL, a medium-sized PG instance (28 cores) can easily reach the recommendation of 500,000 content throughput per second.
DEMO
Take the forum as an example, there are articles, a list of selected articles edited, members, members'reading records. When users open the page, they randomly recommend 20 articles according to the selected list (while filtering the read content).
Suppose the selection list has 2,000 documents and 10 million members.
1. List of articles
create table tbl_art ( artid int8, -- Article ID content text, -- Article content crt_time timestamp -- Article creation time -- ... -- Others, titles, authors,.... );
Membership form (omitted).
2. List of ID of Recommended Articles
create table tbl_art_list ( list_time timestamp primary key, -- List generation time artid int8[] not null, -- What articles are included?( ID),Using Array Storage min_crt_time timestamp not null, -- Of these articles, the oldest is the time. Derive from tbl_art.crt_time . Used to clean up the user's reading log. arrlen int not null -- artid Length (including several elements, i.e. selected articles) -- classid int If members have a category (label), you can create a selection list by category. );
3. Write the selected list and get the last list for recommendation every time you recommend.
If there is a directional requirement (recommended by the Member s tag, change the table tbl_art_list structure, add a CLASS field, and add a CLASS field to the membership table. When members open the page, they match the last list in tbl_art_list with CLASS.
As follows, 2000 selected article ID s are generated. A record.
insert into tbl_art_list values ( now(), array(select (random()*1000000)::int8 from generate_series(1,2000)), now(), 2000 ) ;
4. Readed Records
create table tbl_read_rec ( uid int8, -- Member ID crt_time timestamp, -- Reading time artid int8, -- Article ID primary key(uid,artid) -- Primary key (a document, once a member has been read, recorded only once) ); create index idx_crt_time_1 on tbl_read_rec (crt_time);
5. Random access to recommendation article ID
When the user opens the recommendation page, he enters the user ID and how many selected documents are GET (random GET from the selected list).
Returns an array of selected articles from GET and filters out random article ID s that have been read.
create or replace function get_artid( i_uid int8, -- user ID rows int -- How many randomly obtained articles ID ) returns int8[] as $$ declare v_artid int8[]; -- Selected ID list len int; -- Selected ID Number of Listed Articles res int8[] := (array[])::int8[]; -- Result tmp int8; -- Intermediate variables, selected from ID Random articles from the list ID loopi int := 0; -- Loop variables, how many qualified papers have been obtained ID loopx int := 0; -- Loop variables, how many times have they been looped (upper limit, depending on selection) ID Number of listing articles, such as 1.5 times len) begin select artid,arrlen into v_artid,len from tbl_art_list order by list_time desc limit 1; -- Get the last item from the Editor's Selected List. loop if loopi >= rows or loopx >= 1.5*len then -- Have you traversed all the selected articles? ID (Random traversal) return res; end if; tmp := v_artid[floor(random()*len)+1]; -- Selected articles IDs Get random ID perform 1 from tbl_read_rec where uid=i_uid and artid=tmp; -- Judge whether you have read if not found then res := array_append(res, tmp); -- Unread, APPEND To return the result loopi := loopi +1 ; -- Increase progressively end if; loopx := loopx +1 ; -- Increase progressively end loop; return res; end; $$ language plpgsql strict;
6. Cleaning up Reading Records
Use limit to clean up several items at a time.
Skp lock is used to support parallel DELETE.
delete from tbl_read_rec where ctid = any (array( select ctid from tbl_read_rec where crt_time < (select min_crt_time from tbl_art_list order by list_time desc limit 1) limit 10000 for update skip locked ));
7, test
GET Selected Article ID (Satisfies Random, Filters Readed).
The performance is OK.
postgres=# select get_artid(1,20); get_artid --------------------------------------------------------------------------------------------------------------------------------------------- {919755,3386,100126,761631,447551,511825,168645,211819,862572,330666,942247,600470,843042,511825,295568,829303,382312,452915,499113,164219} (1 row) Time: 0.377 ms postgres=# select get_artid(1,20); get_artid -------------------------------------------------------------------------------------------------------------------------------------------- {257929,796892,343984,363615,418506,326628,91731,958663,127918,794101,49124,410347,852461,922276,366815,926232,134506,153306,123694,67087} (1 row) Time: 0.347 ms
Assuming the random ID obtained, read it immediately and write all the records obtained. (for pressure measurement)
postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing; INSERT 0 20 Time: 0.603 ms postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing; INSERT 0 20 Time: 0.494 ms postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing; INSERT 0 20 Time: 0.479 ms postgres=# insert into tbl_read_rec select 1, now(), unnest(get_artid(1,20)) on conflict do nothing; INSERT 0 20 Time: 0.494 ms
8. Pressure measurement
vi test.sql \set uid random(1,10000000) insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 28 -j 28 -T 120
transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 28 number of threads: 28 duration: 120 s number of transactions actually processed: 3074866 latency average = 1.093 ms latency stddev = 0.577 ms tps = 25623.620112 (including connections establishing) tps = 25625.634577 (excluding connections establishing) statement latencies in milliseconds: 0.002 \set uid random(1,10000000) 1.091 insert into tbl_read_rec select :uid, now(), unnest(get_artid(:uid,20)) on conflict do nothing;
After 120 seconds manometry, the read records were expressed to 3GB.
postgres=# \dt+ List of relations Schema | Name | Type | Owner | Size | Description --------+--------------+-------+----------+------------+------------- public | tbl_art | table | postgres | 8192 bytes | public | tbl_art_list | table | postgres | 64 kB | public | tbl_read_rec | table | postgres | 3047 MB | (3 rows)
61.2 million records have been read.
postgres=# select count(*) from tbl_read_rec ; count ---------- 61206909 (1 row)
as follows
postgres=# select uid,count(*) from tbl_read_rec group by 1 limit 10; uid | count -----+------- 1 | 2000 6 | 20 11 | 20 12 | 20 14 | 19 21 | 20 22 | 40 26 | 19 31 | 20 34 | 19 (10 rows)
For those who have read all, it is no longer recommended, because the selection list has been read all.
postgres=# select get_artid(1,20); get_artid ----------- {} (1 row)
performance
tps : 25623
Recommended return per second: 512460 ID
The recommendation section also has optimization space, such as when the user has read the entire selection list (the more read, the slower GET), to get the list will be slower (because the need to traverse the entire list ID, can not get 20 valid records, consume more, and finally return to NULL). (In this case, GET is about 20 milliseconds)
In this case, the suggestion can mark the user to indicate that all the content has been recommended this time (avoid GET, then the performance will rise dramatically, hundreds of thousands of TPS is no problem), and return to other content.