PostgreSQL database architecture storage structure

Keywords: Database PostgreSQL network Attribute

Logical Structure of Database Cluster

Database Cluster -- database cluster, which is a collection of a group of databases rather than multiple database servers

Database -- database, a collection of database objects

Database object -- database object, a data structure used to store query data Database cluster > Database > database object contains the relationship with the included.

object identifiers (OIDs) - object identifiers. OIDs are stored in system catalog. OIDs of different object types are stored in different catalogs. The OID of database is stored in PG database; the OID of table is stored in PG class

Database cluster,Database,Users,Schemas,Tablespace,Table In general, these logical objects are the relationship between use and being used, not subordinate relationship

postgres=# select oid,datname,dattablespace from pg_database;
oid | datname | dattablespace
-------+-----------+---------------
13212 | postgres | 1663
1 | template1 | 1663
13211 | template0 | 1663
(3 rows)
postgres=# select oid,relname,relowner from pg_class where relname='test_rep';
oid | relname | relowner
-------+----------+----------
57362 | test_rep | 49156
(1 row)

Note 1: system catalog related description reference Official documents
Note 2: OID is pseudo column

Physical Structure of Database Cluster

database cluster is a physical data base directory (PGDATA), including some subdirectories and files

tree -L 2
.
├── backup_label.old
├── base
│   ├── 1
│   ├── 13211
│   └── 13212
├── current_logfiles
├── global
│   ├── 1136
│   ├── 1136_fsm
│   ├── 1136_vm
│   ├── pg_control
│   ├── pg_filenode.map
│   └── pg_internal.init
├── pg_commit_ts
├── pg_dynshmem
├── pg_hba.conf
├── pg_ident.conf
├── pg_log
├── pg_logical
│   ├── mappings
│   ├── replorigin_checkpoint
│   └── snapshots
├── pg_multixact
│   ├── members
│   └── offsets
├── pg_notify
│   └── 0000
├── pg_replslot
├── pg_serial
├── pg_snapshots
├── pg_stat
├── pg_stat_tmp
│   ├── db_0.stat
│   ├── db_13212.stat
│   └── global.stat
├── pg_subtrans
│   └── 0000
├── pg_tblspc
│   ├── 16388 -> /opt/postgres/data/tb1
│   └── 49155 -> /tbs
├── pg_twophase
├── PG_VERSION
├── pg_wal
│   ├── 000000090000000000000077
│   └── archive_status
├── pg_xact
│   └── 0000
├── postgresql.auto.conf
├── postgresql.auto.conf.bak20181230
├── postgresql.conf
├── postmaster.opts
├── postmaster.pid
├── recovery.done
├── serverlog
├── tablespace_map.old
└── tb1

Layout of a Database Cluster

global -- a table that includes the range of clusters, such as PG ﹣ database

PG control -- control file for storing global control information PG ﹣ filenode.map -- hard mapping of OID of system table and specific file name pg_internal.init -- cache system files to speed up system table reading

base -- subdirectory where the database is stored

1136 -- object data file. Each table and index is stored as a separate file. Use the filenode number command (pg_class.relfilenode) of the table or index 1136 ﹣ FSM -- FSM(free space map) pieces corresponding to data files. Map is used to identify which block s are free 1136 ﹣ VM -- in the VM(visibility map)PostgreSQL corresponding to the data file, the deletion or update is realized by marking "invalid" on the tuple header when performing multi version concurrency control. Finally, the free space of the effective data recovery is cleared through the vacuum function. When doing VACUUM, the VM is used to quickly find the block containing invalid tuples. VM is only a simple bitmap, a bit corresponds to a block.

pg_hba.conf -- client network access control configuration file
PG log -- default error log output location
PG ﹣ logical -- stores the data state of logical decoding when configuring logical replication
PG ﹣ tblspc -- store non default tablespace path in the form of soft connection
PG? Wal -- wal log path
postgresql.auto.conf -- parameter file modified with alter system command, with high priority, will overwrite postgresql.conf parameter value

postgresql.conf -- parameter configuration file
postmaster.opts -- a file that records the command-line options used last time the server was started
postmaster.pid -- the file generated after starting pg, recording the information of starting pg
tb1 -- non default tablespace Detailed description Official documents

View control file records

[pg@pg global]$ pg_controldata
pg_control version number: 1002
Catalog version number: 201707211
Database system identifier: 6577238410286647636
Database cluster state: in production
pg_control last modified: Wed 16 Jan 2019 01:52:26 PM CST
Latest checkpoint location: 0/77271BC0
Prior checkpoint location: 0/77271B18
Latest checkpoint's REDO location: 0/77271BC0
Latest checkpoint's REDO WAL file: 000000090000000000000077
Latest checkpoint's TimeLineID: 9
Latest checkpoint's PrevTimeLineID: 9
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:681
Latest checkpoint's NextOID: 81960
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 548
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Wed 16 Jan 2019 01:52:26 PM CST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending locs timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: d689960231e71d87b4ea9a8324398a84ce89709786d3c2cb1a18bbede989b1db

View PG ﹣ filenode.map record

[pg@pg global]$ hexdump pg_filenode.map
0000000 2717 0059 0023 0000 04ee 0000 04ee 0000
0000010 0b94 0000 0b94 0000 04bd 0000 04bd 0000
0000020 0470 0000 0470 0000 04ec 0000 04ec 0000
0000030 04ed 0000 04ed 0000 04be 0000 04be 0000
0000040 095c 0000 095c 0000 1770 0000 1770 0000
0000050 0e08 0000 0e08 0000 17d4 0000 17d4 0000
0000060 0b1e 0000 0b1e 0000 0b1f 0000 0b1f 0000
0000070 0b96 0000 0b96 0000 0b97 0000 0b97 0000
0000080 0fdc 0000 0fdc 0000 0fdd 0000 0fdd 0000
0000090 0a74 0000 0a74 0000 0a75 0000 0a75 0000
00000a0 0a86 0000 0a86 0000 0a87 0000 0a87 0000
00000b0 0a6f 0000 0a6f 0000 0a70 0000 0a70 0000
00000c0 095d 0000 095d 0000 0471 0000 0471 0000
00000d0 04d0 0000 04d0 0000 04d1 0000 04d1 0000
00000e0 0a89 0000 0a89 0000 0a8a 0000 0a8a 0000
00000f0 0b95 0000 0b95 0000 0e09 0000 0e09 0000
0000100 1771 0000 1771 0000 1772 0000 1772 0000
0000110 17e2 0000 17e2 0000 17e3 0000 17e3 0000
0000120 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 9ee5 3d9a 0000 0000
0000200

Layout of Databases

The database is stored in the base directory, and the database directory name is named after OIDs

tree -L 1 base
base
├── 1
├── 13211
└── 13212

Data file layout structure

OID and relfilenode Table or index, stored in the directory of the database, inside the database, each table or index has its own OID. Outside the database, a table or index represents the physical file location through a variable relfilenode.

View PG class to get relfilenode and oid information

postgres=# select relname,relfilenode,oid,relnamespace from pg_class where relname='test_table';
relname | relfilenode | oid | relnamespace
------------+-------------+-------+--------------
test_table | 81961 | 81961 | 81960
(1 row)
//Note 1:
relfilenode Variable, not always with OID Match, such as truncate The command will change relfilenode
postgres=> truncate table test_table;
postgres=> select relname,relfilenode,oid,relnamespace from pg_class where relname='test_table';
relname | relfilenode | oid | relnamespace
------------+-------------+-------+--------------
test_table | 81967 | 81961 | 81960
(1 row)

You can find the object name corresponding to oid through pg_class

postgres=# select oid,relname from pg_class where oid in ( 2610,1255);
oid | relname
------+----------
1255 | pg_proc
2610 | pg_index
(2 rows)

View physical files

Through built-in functions pg_relation_filepath(relation regclass)The corresponding physical file can be found
postgres=> SELECT pg_relation_filepath('test_table');
pg_relation_filepath
----------------------
base/13212/81967
(1 row)
[pg@pg ~]$ ls -atl $PGDATA/base/13212/81967
-rw------- 1 pg pg 0 Jan 24 14:01 /opt/postgres/data/base/13212/81967

Where is the physical location of the system table? What is the corresponding oid?

PG For these catalog Table, put it into a file to manage oid And relfileno Do mapping. This document is famous pg_filenode.map". 

SELECT pg_relation_filepath('pg_class');
 pg_relation_filepath
----------------------
 base/14000/1259
(1 row)

1259

[highgo@dbrs 14000]$ hexdump pg_filenode.map
0000000 2717 0059 000f 0000 04eb 0000 04eb 0000
0000010 04e1 0000 04e1 0000 04e7 0000 04e7 0000
0000020 04df 0000 04df 0000 0b14 0000 0b14 0000
0000030 0b15 0000 0b15 0000 0a62 0000 0a62 0000
0000040 0a63 0000 0a63 0000 0a66 0000 0a66 0000
0000050 0a67 0000 0a67 0000 0d7f 0000 0d7f 0000
0000060 0a82 0000 0a82 0000 0a83 0000 0a83 0000
0000070 0a8f 0000 0a8f 0000 0a90 0000 0a90 0000
0000080 0000 0000 0000 0000 0000 0000 0000 0000
*
00001f0 0000 0000 0000 0000 9965 3aa5 0000 0000
0000200
[highgo@dbrs 14000]$
04eb =1259

Data file size

When a table or index exceeds 1 GB, it is divided into G-size segments. The file names are relfilenode, relfilenode.1, relfilenode.2 Note: the maximum file size of tables and indexes can be changed using the configuration, option -- with segsize when building PostgreSQL
Each data file has two related suffix files,''fsm 'and' VM ', respectively corresponding to free space map and visibility map, and the index data file only has''fsm'.
Unlogged tables and indexes have a third fork, known as the initialization fork, which is stored in a fork with the suffix _init

See

postgres=# create unlogged table test_unlog(a int);
CREATE TABLE
postgres=# insert into test_unlog select generate_series(1,1000);
INSERT 0 1000
postgres=# SELECT pg_relation_filepath('test_unlog');
pg_relation_filepath
----------------------
base/13212/102776
(1 row)
[pg@pg 13212]$ ls -atl 102776*
-rw------- 1 pg pg 40960 Mar 20 17:35 102776
-rw------- 1 pg pg 24576 Mar 20 17:35 102776_fsm
-rw------- 1 pg pg 0 Mar 20 17:35 102776_init

Physical file space size

pg_total_relation_size(regclass)
bigint
Total disk space used by the specified table, including all indexes and TOAST data
pg_table_size(regclass)
bigint
Disk space used by the specified table, excluding indexes (but including TOAST, free space map, and visibility map)
pg_indexes_size(regclass)
bigint
Total disk space used by indexes attached to the specified table
pg_relation_size(relation regclass, fork text)
bigint
Disk space used by the specified fork ('main', 'fsm', 'vm', or 'init') of the specified table or index


postgres=> select pg_size_pretty(pg_table_size('test_toast'));
pg_size_pretty
----------------
4016 kB
(1 row)

Tablespace

tablespace -- independent of base directory, it is an additional data store data area The following two storage areas represent

cluster>'base'>database>object
cluster>'tablespace'>[database]>object

When the tablespace path is created in tablespace, its subdirectories will be created automatically under the directory. The format is as follows PG_'Major version'_'Catalogue version number'

[pg@pg /]$ tree -L 1 /tbs
/tbs
└── PG_10_201707211

The types of tablespaces are: default, system shared, and custom

postgres=# select oid,spcname,pg_tablespace_location(oid) from pg_tablespace;
oid | spcname | pg_tablespace_location
-------+------------+------------------------
1663 | pg_default |
1664 | pg_global |
16388 | tb1 | /opt/postgres/data/tb1
49155 | pitrtbs | /tbs
(4 rows)

All custom tablespace directories will create symbolic links in $PGDATA/pg_tblspc with the same link name as the OID of the tablespace

select oid,spcname,spcowner,spcoptions from pg_tablespace;

[pg@pg /]$ tree -L 1 $PGDATA/pg_tblspc
/opt/postgres/data/pg_tblspc
├── 16388 -> /opt/postgres/data/tb1
└── 49155 -> /tbs

If the created table belongs to the database under the base, a directory with the same name as the existing database oid will be created under the tablespace, and then the data files will be placed in this directory

[pg@pg /]$ tree -L 3 /tbs
/tbs
└── PG_10_201707211
└── 13212
├── 49163
├── 49166
└── 49168

oid2name

oid2name — resolve OIDs and file nodes in a PostgreSQL data directory Use

[pg@pg base]$ oid2name
All databases:
Oid Database Name Tablespace
----------------------------------
13212 postgres pg_default
13211 template0 pg_default
1 template1 pg_default
[pg@pg base]$ oid2name -s
All tablespaces:
Oid Tablespace Name
------------------------
1663 pg_default
1664 pg_global
16388 tb1
49155 pitrtbs
[pg@pg 13212]$ pwd
/opt/postgres/data/base/13212
[pg@pg 13212]$ ls -atl |head -5
total 14772
drwx------ 2 pg pg 8192 Feb 27 09:51 .
-rw------- 1 pg pg 136164 Feb 27 09:51 pg_internal.init
-rw------- 1 pg pg 32768 Feb 26 09:00 2610
-rw------- 1 pg pg 606208 Feb 23 11:23 1255
[pg@pg 13212]$
[pg@pg 13212]$ oid2name -d postgres -f 2610 -f 1255
From database "postgres":
Filenode Table Name
----------------------
2610 pg_index
1255 pg_proc

3 internal structure of data file

page -- the smallest storage unit in the database. Within the data file (heap table, index, free space map and visibility map), the space is allocated to fixed length pages. The default size is 8192 bytes (8 KB)
Block numbers -- each page is numbered sequentially from 0 (the number is called block numbers). If the page is full, PostgreSQL will add a new empty page at the end of the file to increase the file size.

block0
header_data(24byte)	pg_lsn	xxx	xxx
xxx	pg_lower	pg_upper	xxx
	line_pointer_1(4byte)	line_pointer_2	xxx
	freespace
		freespace
	xxx	heap_tuple_2	heap_tuple_1

page structure includes three parts

header data -- in the page header, the length of 24 bytes is used to record the metadata information of the page

PG ﹣ lsn -- stores the lsn information of the wal log when the latest page changes pg_checksum -- store the page verification value PG ﹣ lower, PG ﹣ upper -- PD ﹣ lower points to the tail of line pointer; PD ﹣ upper points to the head of the latest heap tuple pd_special -- mainly used for index, in the table, point to the last page

line pointer(s) - after header data. The length of the row pointer is 4 bytes, which is used to save the pointer to each heap tuple. It is the index of the heap tuple. Every tuple will have a line pointer

heap tuple(s) -- used to store data, starting from the bottom of the page (row data)

tuple identifier (TID) - to identify tuples in a table, a tuple identifier (TID) is used internally. tid contains a pair of values, the block numbers of the tuple's page, and the offset numbers of the line pointer pointing to the tuple. This is a typical index usage for finding tuple data. In addition, the heap tuple has a total size of more than 2KB (about 1/4 8KB), which is managed by a toast (the overarching attribute storage technology)

From the end of line pointer to the head of the latest heap tuple, the middle part is called free space

Reading and writing methods of 4 tuple

How does the database read and write? A simple description is given below, which will be described in more detail later combined with relevant contents Suppose a table contains only one page, and a page contains only one heap tuple

write

At present, the structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | | | | line_pointer_1(4byte) | | | | | freespace | | | | | | freespace | | | | xxx | xxx|heap_tuple_1| |

As shown in the figure, assign a page (8192 bytes) with the serial number of block 24,

The first 24 bytes of the page store header data, the next 4 bytes store line pointer 1, the middle space is free space, and the first heap tuple 1 is stored at the end of the page
line pointer 1 points to heap tuple 1
PG lower points to line pointer 1
PG > upper points to heap tuple 1

Now insert a row of data, the structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | xxx | | | line_pointer_1(4byte) | line_pointer_2 | xxx | | | freespace | | | | | | freespace | | | | xxx | heap_tuple_2|heap_tuple_1| |

Allocate new 4 bytes after line pointer 1 to store line pointer 2, and store the latest heap tuple 2 before heap tuple 1
line pointer 1 points to heap tuple 1
line pointer 2 points to heap tuple 2
PG [lower] points to line pointer 2
PG? Upper points to heap tuple 2

4.2 read: it can be divided into two types: sequential read and index read

1 Sequential scan
2 index scan -- TID value of the obtained index tuple is' (block = 24, offset = 2) '

4.2.1 Sequential scan

The structure is as follows | block24 | | | | | --- | --- | --- | --- | | header_data(24byte) | pg_lsn | xxx | xxx | | xxx | pg_lower | pg_upper | xxx | | | line_pointer_1(4byte) | line_pointer_2 | xxx | | | freespace | | | | | | freespace | | | | xxx | heap_tuple_2|heap_tuple_1| |

select * from test_table;

here

Scan page of block 1, scan line pointer 1...line pointer n
Scan the page of block 2, scan line pointer 1...line pointer n
......
Scan the page of block 24, scan line pointer 1 to find the required data heap tuple1

4.2.2 index reading: B-tree index scan

select * from test_table where id=244;

here Directly through the index tuple record '(block = 24, Offset = 2)), find the page of block 24, read line pointer 2, and find the desired heap tuple 2

Reference resources

The pgstattuple module provides a variety of functions to obtain tuple level statistics.
The pageinspect module provides functions that let you view the contents of database pages from a low level
Official documents

Posted by forums on Sun, 26 Apr 2020 01:56:33 -0700

Programmer Group