Label
PostgreSQL , 10.0 , hash index
background
hash index is a very old index access method in PostgreSQL, and it is also a very classic index.
The hash value of the index field is stored in the hash index, not the original value. The original value is stored in the btree index.
Therefore, when the field is very large, the btree index may not be available.
for example
postgres=# create table test_hash_btree(c1 text); CREATE TABLE postgres=# insert into test_hash_btree values (repeat(random()::text,10000)); INSERT 0 1 postgres=# create index idx on test_hash_btree (c1); CREATE INDEX postgres=# insert into test_hash_btree values (repeat(random()::text,100000)); ERROR: index row requires 19504 bytes, maximum size is 8191 postgres=# drop index idx; DROP INDEX postgres=# create index idx on test_hash_btree using hash(c1); WARNING: hash indexes are not WAL-logged and their use is discouraged CREATE INDEX postgres=# insert into test_hash_btree values (repeat(random()::text,100000)); INSERT 0 1 postgres=# insert into test_hash_btree values (repeat(random()::text,10000000)); INSERT 0 1
In this case, you can use hash index, and because hash index stores hash, it can only be used as an equivalent query.
Unless there is a hash function that guarantees consistency between the hashed value and the pre-hashed value sequence. Otherwise, hash index cannot support queries for sorting, >, <, >=, <=.
Structure of hash index
Hash index contains four kinds of pages, meta page, primary bucket page, overflow page, bitmap page.
metapage (page 0) contains control information for the HASH index to guide how to find other pages (primary page s for each bucket) and the current storage overview. Page 0 of other indexes is basically this routine.
primary bucket page, hash index divides storage into buckets (logical concepts). Each bucket contains several pages (the number of pages per bucket does not need to be consistent). When inserting data, according to the calculated hash, the data is mapped to a bucket by mapping algorithm. That is to say, the data first knows which bucket to insert, and then inserts primar into the bucket. Y page, if primary page space is insufficient, overflow page is extended and data is written to overflow page.
In pages, data is stored in order (TREE), binary search is supported in pages, and there is no guarantee of order between pages, so hash index does not support order by.
Overflow page is the page in bucket. When primary page does not have enough space, the extended block is called overflow page.
bimap page, record primary, overflow page is empty and can be reused.
Note that bucket and page do not provide shrinkage, i.e. they cannot shrink space from OS, but they provide reuse (tracked by bitmap page).
10.0 Hash Index Enhancement
1. Cache Hash Index meta page
Caching hash index meta page pages into private memory of backend process reduces access to meta page.
2. Concurrent Hash Indexes
Enhance query performance substantially, nearly double that of BTREE.
Patch_Ver/Client count 1 8 16 32 64 72 80 88 96 128 HEAD-Btree 19397 122488 194433 344524 519536 527365 597368 559381 614321 609102 HEAD-Hindex 18539 141905 218635 363068 512067 522018 492103 484372 440265 393231 Patch 22504 146937 235948 419268 637871 637595 674042 669278 683704 639967 % improvement between HEAD-Hash index vs Patch and HEAD-Btree index vs Patch-Hash index is: Head-Hash vs Patch 21.38 3.5 7.9 15.47 24.56 22.14 36.97 38.17 55.29 62.74 Head-Btree vs. Patch 16.01 19.96 21.35 21.69 22.77 20.9 12.83 19.64 11.29 5.06 This data shows that patch improves the performance of hash index upto 62.74 and it also makes hash-index faster than btree-index by ~20% (most client counts show the performance improvement in the range of 15~20%. For the matter of comparison with btree, I think the impact of performance improvement of hash index will be more when the data doesn't fit shared buffers and the performance data for same is as below: Data doesn't fits in shared buffers scale_factor - 3000 shared_buffers - 8GB Client_Count/Patch 16 64 96 Head-Btree 170042 463721 520656 Patch-Hash 227528 603594 659287 % diff 33.8 30.16 26.62 The performance with hash-index is ~30% better than Btree. Note, that for now, I have not taken the data for HEAD- Hash index. I think there will many more cases like when hash index is on char (20) column where the performance of hash-index can be much better than btree-index for equal to searches. Note that this patch is a very-much WIP patch and I am posting it mainly to facilitate the discussion. Currently, it doesn't have any code to perform incomplete splits, the logic for locking/pins during Insert is yet to be done and many more things
For this discussion of patch, see Mail Group, URL at the end of this article.
PostgreSQL community style is very rigorous, a patch may be discussed in the mail group for several months or even years, according to your opinion repeated amendments, patch merged into master is very mature, so the stability of PostgreSQL is well known.
Reference resources
https://commitfest.postgresql.org/13/715/
https://commitfest.postgresql.org/10/647/
src/backend/access/hash/README