Logical Structure of Database Cluster
- Database Cluster -- database cluster, which is a collection of a group of databases rather than multiple database servers
- Database -- database, a collection of database objects
- Database object -- database object, a data structure used to store query data Database cluster > Database > database object contains the relationship with the included.
- object identifiers (OIDs) - object identifiers. OIDs are stored in system catalog. OIDs of different object types are stored in different catalogs. The OID of database is stored in PG database; the OID of table is stored in PG class
- Database cluster,Database,Users,Schemas,Tablespace,Table In general, these logical objects are the relationship between use and being used, not subordinate relationship
postgres=# select oid,datname,dattablespace from pg_database; oid | datname | dattablespace -------+-----------+--------------- 13212 | postgres | 1663 1 | template1 | 1663 13211 | template0 | 1663 (3 rows) postgres=# select oid,relname,relowner from pg_class where relname='test_rep'; oid | relname | relowner -------+----------+---------- 57362 | test_rep | 49156 (1 row)
- Note 1: system catalog related description reference Official documents
- Note 2: OID is pseudo column
Physical Structure of Database Cluster
database cluster is a physical data base directory (PGDATA), including some subdirectories and files
tree -L 2 . ├── backup_label.old ├── base │ ├── 1 │ ├── 13211 │ └── 13212 ├── current_logfiles ├── global │ ├── 1136 │ ├── 1136_fsm │ ├── 1136_vm │ ├── pg_control │ ├── pg_filenode.map │ └── pg_internal.init ├── pg_commit_ts ├── pg_dynshmem ├── pg_hba.conf ├── pg_ident.conf ├── pg_log ├── pg_logical │ ├── mappings │ ├── replorigin_checkpoint │ └── snapshots ├── pg_multixact │ ├── members │ └── offsets ├── pg_notify │ └── 0000 ├── pg_replslot ├── pg_serial ├── pg_snapshots ├── pg_stat ├── pg_stat_tmp │ ├── db_0.stat │ ├── db_13212.stat │ └── global.stat ├── pg_subtrans │ └── 0000 ├── pg_tblspc │ ├── 16388 -> /opt/postgres/data/tb1 │ └── 49155 -> /tbs ├── pg_twophase ├── PG_VERSION ├── pg_wal │ ├── 000000090000000000000077 │ └── archive_status ├── pg_xact │ └── 0000 ├── postgresql.auto.conf ├── postgresql.auto.conf.bak20181230 ├── postgresql.conf ├── postmaster.opts ├── postmaster.pid ├── recovery.done ├── serverlog ├── tablespace_map.old └── tb1
Layout of a Database Cluster
- global -- a table that includes the range of clusters, such as PG ﹣ database
PG control -- control file for storing global control information PG ﹣ filenode.map -- hard mapping of OID of system table and specific file name pg_internal.init -- cache system files to speed up system table reading
- base -- subdirectory where the database is stored
1136 -- object data file. Each table and index is stored as a separate file. Use the filenode number command (pg_class.relfilenode) of the table or index 1136 ﹣ FSM -- FSM(free space map) pieces corresponding to data files. Map is used to identify which block s are free 1136 ﹣ VM -- in the VM(visibility map)PostgreSQL corresponding to the data file, the deletion or update is realized by marking "invalid" on the tuple header when performing multi version concurrency control. Finally, the free space of the effective data recovery is cleared through the vacuum function. When doing VACUUM, the VM is used to quickly find the block containing invalid tuples. VM is only a simple bitmap, a bit corresponds to a block.
- pg_hba.conf -- client network access control configuration file
- PG log -- default error log output location
- PG ﹣ logical -- stores the data state of logical decoding when configuring logical replication
- PG ﹣ tblspc -- store non default tablespace path in the form of soft connection
- PG? Wal -- wal log path
- postgresql.auto.conf -- parameter file modified with alter system command, with high priority, will overwrite postgresql.conf parameter value
- postgresql.conf -- parameter configuration file
- postmaster.opts -- a file that records the command-line options used last time the server was started
- postmaster.pid -- the file generated after starting pg, recording the information of starting pg
- tb1 -- non default tablespace Detailed description Official documents
View control file records
[pg@pg global]$ pg_controldata pg_control version number: 1002 Catalog version number: 201707211 Database system identifier: 6577238410286647636 Database cluster state: in production pg_control last modified: Wed 16 Jan 2019 01:52:26 PM CST Latest checkpoint location: 0/77271BC0 Prior checkpoint location: 0/77271B18 Latest checkpoint's REDO location: 0/77271BC0 Latest checkpoint's REDO WAL file: 000000090000000000000077 Latest checkpoint's TimeLineID: 9 Latest checkpoint's PrevTimeLineID: 9 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:681 Latest checkpoint's NextOID: 81960 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 548 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Wed 16 Jan 2019 01:52:26 PM CST Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending locs timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: d689960231e71d87b4ea9a8324398a84ce89709786d3c2cb1a18bbede989b1db
View PG ﹣ filenode.map record
[pg@pg global]$ hexdump pg_filenode.map 0000000 2717 0059 0023 0000 04ee 0000 04ee 0000 0000010 0b94 0000 0b94 0000 04bd 0000 04bd 0000 0000020 0470 0000 0470 0000 04ec 0000 04ec 0000 0000030 04ed 0000 04ed 0000 04be 0000 04be 0000 0000040 095c 0000 095c 0000 1770 0000 1770 0000 0000050 0e08 0000 0e08 0000 17d4 0000 17d4 0000 0000060 0b1e 0000 0b1e 0000 0b1f 0000 0b1f 0000 0000070 0b96 0000 0b96 0000 0b97 0000 0b97 0000 0000080 0fdc 0000 0fdc 0000 0fdd 0000 0fdd 0000 0000090 0a74 0000 0a74 0000 0a75 0000 0a75 0000 00000a0 0a86 0000 0a86 0000 0a87 0000 0a87 0000 00000b0 0a6f 0000 0a6f 0000 0a70 0000 0a70 0000 00000c0 095d 0000 095d 0000 0471 0000 0471 0000 00000d0 04d0 0000 04d0 0000 04d1 0000 04d1 0000 00000e0 0a89 0000 0a89 0000 0a8a 0000 0a8a 0000 00000f0 0b95 0000 0b95 0000 0e09 0000 0e09 0000 0000100 1771 0000 1771 0000 1772 0000 1772 0000 0000110 17e2 0000 17e2 0000 17e3 0000 17e3 0000 0000120 0000 0000 0000 0000 0000 0000 0000 0000 * 00001f0 0000 0000 0000 0000 9ee5 3d9a 0000 0000 0000200
Layout of Databases
The database is stored in the base directory, and the database directory name is named after OIDs
tree -L 1 base base ├── 1 ├── 13211 └── 13212
Data file layout structure
- OID and relfilenode Table or index, stored in the directory of the database, inside the database, each table or index has its own OID. Outside the database, a table or index represents the physical file location through a variable relfilenode.
View PG class to get relfilenode and oid information
postgres=# select relname,relfilenode,oid,relnamespace from pg_class where relname='test_table'; relname | relfilenode | oid | relnamespace ------------+-------------+-------+-------------- test_table | 81961 | 81961 | 81960 (1 row) //Note 1: relfilenode Variable, not always with OID Match, such as truncate The command will change relfilenode postgres=> truncate table test_table; postgres=> select relname,relfilenode,oid,relnamespace from pg_class where relname='test_table'; relname | relfilenode | oid | relnamespace ------------+-------------+-------+-------------- test_table | 81967 | 81961 | 81960 (1 row)
You can find the object name corresponding to oid through pg_class
postgres=# select oid,relname from pg_class where oid in ( 2610,1255); oid | relname ------+---------- 1255 | pg_proc 2610 | pg_index (2 rows)
View physical files
Through built-in functions pg_relation_filepath(relation regclass)The corresponding physical file can be found postgres=> SELECT pg_relation_filepath('test_table'); pg_relation_filepath ---------------------- base/13212/81967 (1 row) [pg@pg ~]$ ls -atl $PGDATA/base/13212/81967 -rw------- 1 pg pg 0 Jan 24 14:01 /opt/postgres/data/base/13212/81967
Where is the physical location of the system table? What is the corresponding oid?
PG For these catalog Table, put it into a file to manage oid And relfileno Do mapping. This document is famous pg_filenode.map". SELECT pg_relation_filepath('pg_class'); pg_relation_filepath ---------------------- base/14000/1259 (1 row) 1259 [highgo@dbrs 14000]$ hexdump pg_filenode.map 0000000 2717 0059 000f 0000 04eb 0000 04eb 0000 0000010 04e1 0000 04e1 0000 04e7 0000 04e7 0000 0000020 04df 0000 04df 0000 0b14 0000 0b14 0000 0000030 0b15 0000 0b15 0000 0a62 0000 0a62 0000 0000040 0a63 0000 0a63 0000 0a66 0000 0a66 0000 0000050 0a67 0000 0a67 0000 0d7f 0000 0d7f 0000 0000060 0a82 0000 0a82 0000 0a83 0000 0a83 0000 0000070 0a8f 0000 0a8f 0000 0a90 0000 0a90 0000 0000080 0000 0000 0000 0000 0000 0000 0000 0000 * 00001f0 0000 0000 0000 0000 9965 3aa5 0000 0000 0000200 [highgo@dbrs 14000]$ 04eb =1259
Data file size
- When a table or index exceeds 1 GB, it is divided into G-size segments. The file names are relfilenode, relfilenode.1, relfilenode.2 Note: the maximum file size of tables and indexes can be changed using the configuration, option -- with segsize when building PostgreSQL
- Each data file has two related suffix files,''fsm 'and' VM ', respectively corresponding to free space map and visibility map, and the index data file only has''fsm'.
- Unlogged tables and indexes have a third fork, known as the initialization fork, which is stored in a fork with the suffix _init
See
postgres=# create unlogged table test_unlog(a int); CREATE TABLE postgres=# insert into test_unlog select generate_series(1,1000); INSERT 0 1000 postgres=# SELECT pg_relation_filepath('test_unlog'); pg_relation_filepath ---------------------- base/13212/102776 (1 row) [pg@pg 13212]$ ls -atl 102776* -rw------- 1 pg pg 40960 Mar 20 17:35 102776 -rw------- 1 pg pg 24576 Mar 20 17:35 102776_fsm -rw------- 1 pg pg 0 Mar 20 17:35 102776_init
Physical file space size
pg_total_relation_size(regclass) bigint Total disk space used by the specified table, including all indexes and TOAST data pg_table_size(regclass) bigint Disk space used by the specified table, excluding indexes (but including TOAST, free space map, and visibility map) pg_indexes_size(regclass) bigint Total disk space used by indexes attached to the specified table pg_relation_size(relation regclass, fork text) bigint Disk space used by the specified fork ('main', 'fsm', 'vm', or 'init') of the specified table or index postgres=> select pg_size_pretty(pg_table_size('test_toast')); pg_size_pretty ---------------- 4016 kB (1 row)
Tablespace
tablespace -- independent of base directory, it is an additional data store data area The following two storage areas represent
- cluster>'base'>database>object
- cluster>'tablespace'>[database]>object
When the tablespace path is created in tablespace, its subdirectories will be created automatically under the directory. The format is as follows PG_'Major version'_'Catalogue version number'
[pg@pg /]$ tree -L 1 /tbs /tbs └── PG_10_201707211
The types of tablespaces are: default, system shared, and custom
postgres=# select oid,spcname,pg_tablespace_location(oid) from pg_tablespace; oid | spcname | pg_tablespace_location -------+------------+------------------------ 1663 | pg_default | 1664 | pg_global | 16388 | tb1 | /opt/postgres/data/tb1 49155 | pitrtbs | /tbs (4 rows)
All custom tablespace directories will create symbolic links in $PGDATA/pg_tblspc with the same link name as the OID of the tablespace
select oid,spcname,spcowner,spcoptions from pg_tablespace;
[pg@pg /]$ tree -L 1 $PGDATA/pg_tblspc /opt/postgres/data/pg_tblspc ├── 16388 -> /opt/postgres/data/tb1 └── 49155 -> /tbs
If the created table belongs to the database under the base, a directory with the same name as the existing database oid will be created under the tablespace, and then the data files will be placed in this directory
[pg@pg /]$ tree -L 3 /tbs /tbs └── PG_10_201707211 └── 13212 ├── 49163 ├── 49166 └── 49168
oid2name
oid2name — resolve OIDs and file nodes in a PostgreSQL data directory Use
[pg@pg base]$ oid2name All databases: Oid Database Name Tablespace ---------------------------------- 13212 postgres pg_default 13211 template0 pg_default 1 template1 pg_default [pg@pg base]$ oid2name -s All tablespaces: Oid Tablespace Name ------------------------ 1663 pg_default 1664 pg_global 16388 tb1 49155 pitrtbs [pg@pg 13212]$ pwd /opt/postgres/data/base/13212 [pg@pg 13212]$ ls -atl |head -5 total 14772 drwx------ 2 pg pg 8192 Feb 27 09:51 . -rw------- 1 pg pg 136164 Feb 27 09:51 pg_internal.init -rw------- 1 pg pg 32768 Feb 26 09:00 2610 -rw------- 1 pg pg 606208 Feb 23 11:23 1255 [pg@pg 13212]$ [pg@pg 13212]$ oid2name -d postgres -f 2610 -f 1255 From database "postgres": Filenode Table Name ---------------------- 2610 pg_index 1255 pg_proc
3 internal structure of data file
- page -- the smallest storage unit in the database. Within the data file (heap table, index, free space map and visibility map), the space is allocated to fixed length pages. The default size is 8192 bytes (8 KB)
- Block numbers -- each page is numbered sequentially from 0 (the number is called block numbers). If the page is full, PostgreSQL will add a new empty page at the end of the file to increase the file size.
block0 | |||
---|---|---|---|
header_data(24byte) | pg_lsn | xxx | xxx |
xxx | pg_lower | pg_upper | xxx |
line_pointer_1(4byte) | line_pointer_2 | xxx | |
freespace | |||
freespace | |||
xxx | heap_tuple_2 | heap_tuple_1 |
page structure includes three parts
- header data -- in the page header, the length of 24 bytes is used to record the metadata information of the page
PG ﹣ lsn -- stores the lsn information of the wal log when the latest page changes pg_checksum -- store the page verification value PG ﹣ lower, PG ﹣ upper -- PD ﹣ lower points to the tail of line pointer; PD ﹣ upper points to the head of the latest heap tuple pd_special -- mainly used for index, in the table, point to the last page
- line pointer(s) - after header data. The length of the row pointer is 4 bytes, which is used to save the pointer to each heap tuple. It is the index of the heap tuple. Every tuple will have a line pointer
- heap tuple(s) -- used to store data, starting from the bottom of the page (row data)
tuple identifier (TID) - to identify tuples in a table, a tuple identifier (TID) is used internally. tid contains a pair of values, the block numbers of the tuple's page, and the offset numbers of the line pointer pointing to the tuple. This is a typical index usage for finding tuple data. In addition, the heap tuple has a total size of more than 2KB (about 1/4 8KB), which is managed by a toast (the overarching attribute storage technology)
- From the end of line pointer to the head of the latest heap tuple, the middle part is called free space
Reading and writing methods of 4 tuple
How does the database read and write? A simple description is given below, which will be described in more detail later combined with relevant contents Suppose a table contains only one page, and a page contains only one heap tuple
write
At present, the structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | | | | line_pointer_1(4byte) | | | | | freespace | | | | | | freespace | | | | xxx | xxx|heap_tuple_1| |
As shown in the figure, assign a page (8192 bytes) with the serial number of block 24,
- The first 24 bytes of the page store header data, the next 4 bytes store line pointer 1, the middle space is free space, and the first heap tuple 1 is stored at the end of the page
- line pointer 1 points to heap tuple 1
- PG lower points to line pointer 1
- PG > upper points to heap tuple 1
Now insert a row of data, the structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | xxx | | | line_pointer_1(4byte) | line_pointer_2 | xxx | | | freespace | | | | | | freespace | | | | xxx | heap_tuple_2|heap_tuple_1| |
- Allocate new 4 bytes after line pointer 1 to store line pointer 2, and store the latest heap tuple 2 before heap tuple 1
- line pointer 1 points to heap tuple 1
- line pointer 2 points to heap tuple 2
- PG [lower] points to line pointer 2
- PG? Upper points to heap tuple 2
4.2 read: it can be divided into two types: sequential read and index read
- 1 Sequential scan
- 2 index scan -- TID value of the obtained index tuple is' (block = 24, offset = 2) '
4.2.1 Sequential scan
The structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | xxx | | | line_pointer_1(4byte) | line_pointer_2 | xxx | | | freespace | | | | | | freespace | | | | xxx | heap_tuple_2|heap_tuple_1| |
select * from test_table;
here
- Scan page of block 1, scan line pointer 1...line pointer n
- Scan the page of block 2, scan line pointer 1...line pointer n
- ......
- Scan the page of block 24, scan line pointer 1 to find the required data heap tuple1
4.2.2 index reading: B-tree index scan
The structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | xxx | | | line_pointer_1(4byte) | line_pointer_2 | xxx | | | freespace | | | | | | freespace | | | | xxx | heap_tuple_2|heap_tuple_1| |
select * from test_table where id=244;
here Directly through the index tuple record '(block = 24, Offset = 2)), find the page of block 24, read line pointer 2, and find the desired heap tuple 2
Reference resources
- The pgstattuple module provides a variety of functions to obtain tuple level statistics.
- The pageinspect module provides functions that let you view the contents of database pages from a low level
- Official documents