Label
PostgreSQL, st_contains, st_within, spatial inclusion, spatial bound box, GiST index, spatial index structure, IO amplification, BOUND BOX amplification
background
Point-to-surface judgment, selection of points by face circle or other objects are very typical requirements in the application of GIS geometry.
GiST indexing in PostgreSQL can speed up such judgments, but is indexing enough?
In many cases, indexing is not enough, performance has not reached its peak, if you want lower latency, less CPU overhead, what other optimization means?
In fact, I've written a similar article before about the optimization of BTree index access. When the linear correlation between data storage and index order is very poor, I introduce a problem: IO enlargement when accessing:
The principles and solutions are well documented. There are similar problems and optimization methods for spatial indexing. But first you need to understand the structure of spatial index:
Understanding the Construction of GiST Index through Spatial Thought
Then you can reduce the IO of spatial scanning by aggregating space.
PostgreSQL Black Technology - Spatial Aggregate Storage
Next, an example of search is given to illustrate the optimization method of space containing search.
There are 10 million spatial object data in the table to query the spatial object covered by a polygon. This query has a feature that the polygon is a long strip of polygon, and the BOUND BOX containing this polygon is relatively large.
The way to construct this polygon
postgres=# select st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326); st_setsrid ---------------------------- 0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000 (1 row)
Optimizing Method 1 - Spatial Aggregation
1. Build tables
postgres=# create table e(id int8, pos geometry); CREATE TABLE
2. Write space test data (10 million random points covering the longitude and latitude range of +50)
postgres=# insert into e select id, st_setsrid(st_makepoint(50-random()*100, 50-random()*100), 4326) from generate_series(1,10000000) t(id); INSERT 0 10000000
3. Creating spatial index
postgres=# create index idx_e on e using gist(pos); CREATE INDEX
4. Query the object that satisfies the BOUND BOX condition of the object covered by the BOUND BOX of this polygon.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from e where pos @ st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326); QUERY PLAN ----------------------- Index Scan using idx_e on public.e (cost=0.42..12526.72 rows=10000 width=40) (actual time=0.091..39.449 rows=35081 loops=1) Output: id, pos Index Cond: (e.pos @ '0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry) Buffers: shared hit=35323 Planning time: 0.108 ms Execution time: 41.222 ms (6 rows)
Searched 35323 data blocks and returned 35081 records.
5. Query the object contained by this polygon.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from e where st_contains(st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326), pos); QUERY PLAN ----------------------- Index Scan using idx_e on public.e (cost=0.42..15026.72 rows=3333 width=40) (actual time=0.077..49.015 rows=8491 loops=1) Output: id, pos Index Cond: ('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry ~ e.pos) Filter: _st_contains('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, e.pos) Rows Removed by Filter: 26590 Buffers: shared hit=35323 Planning time: 0.085 ms Execution time: 49.460 ms (8 rows)
Searched 35323 data blocks, searched 35081 records, returned 8491 records, filtered 26590 records that did not meet the conditions.
The query differences between 5 and 4 are BOUND BOX inclusion and actual contour inclusion. The index is based on bound box. We can also learn about this principle in the following documents.
Understanding the Construction of GiST Index through Spatial Thought
We can see that there are not many records of composite conditions, but many data blocks are searched. Spatial aggregation can reduce the scanning of data blocks.
6. Create another table and adjust the order of data storage according to spatial aggregation. And establish spatial index.
postgres=# create table f(like e); CREATE TABLE postgres=# insert into f select * from e order by st_geohash(pos,15); INSERT 0 10000000 postgres=# create index idx_f on f using gist(pos); CREATE INDEX
7. After optimization:
Query the object that satisfies the BOUND BOX condition of the object covered by the BOUND BOX of this polygon. From scanning 35323 data blocks to accessing 1648 data blocks. A qualitative leap.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from f where pos @ st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326); QUERY PLAN ----------------------- Index Scan using idx_f on public.f (cost=0.42..12526.72 rows=10000 width=40) (actual time=0.081..9.702 rows=35081 loops=1) Output: id, pos Index Cond: (f.pos @ '0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry) Buffers: shared hit=1648 Planning time: 0.096 ms Execution time: 11.404 ms (6 rows)
8. After optimization:
Query the object contained in this polygon. From scanning 35323 data blocks to accessing 1648 data blocks. A qualitative leap.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from f where st_contains(st_setsrid(st_makepolygon(ST_GeomFromText('LINESTRING(0 0,1 0,1 2.5,6 2.5,6 4,7 4,7 5,5 5,5 3,0 3,0 0)')), 4326), pos); QUERY PLAN ----------------------- Index Scan using idx_f on public.f (cost=0.42..15026.72 rows=3333 width=40) (actual time=1.216..32.398 rows=8491 loops=1) Output: id, pos Index Cond: ('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry ~ f.pos) Filter: _st_contains('0103000020E6100000010000000B00000000000000000000000000000000000000000000000000F03F0000000000000000000000000000F03F000000000000044000000000000018400000000000000440000000000000184000000000000010400000000000001C4000000000000010400000000000001C40000000000000144000000000000014400000000000001440000000000000144000000000000008400000000000000000000000000000084000000000000000000000000000000000'::geometry, f.pos) Rows Removed by Filter: 26590 Buffers: shared hit=1648 Planning time: 0.101 ms Execution time: 32.837 ms (8 rows)
Using spatial aggregation, the number of data blocks scanned was reduced from 35323 to 1648. A qualitative leap.
Optimizing Method 2 - Spatial Split Query
The optimization method of spatial aggregation solves the problem of IO amplification. Another optimization point is related to the structure of spatial index, which is the problem of BOUND BOX amplification.
From the example in this paper, we can also see that the spatial index is actually for bound box, so when the effective area ratio is low, most invalid data may be circled, resulting in both IO and CPU enlargement, so we can solve it.
The dotted line section in the figure below contains the BOUND BOX of this strip. Currently, when using GiST index queries to satisfy the POS conditions contained in this polygon, the database will get out all the objects that fall into this BOUND BOX.
Optimizing ideas:
Divide the polygon into four BOX to completely eliminate the problem of bound box enlargement.
explain (analyze,verbose,timing,costs,buffers) select * from f where st_contains(st_setsrid(st_makebox2d(st_makepoint(0,0), st_makepoint(1,3)), 4326), pos) or st_contains(st_setsrid(st_makebox2d(st_makepoint(1,2.5), st_makepoint(5,3)), 4326), pos) or st_contains(st_setsrid(st_makebox2d(st_makepoint(5,2.5), st_makepoint(6,5)), 4326), pos) or st_contains(st_setsrid(st_makebox2d(st_makepoint(6,4), st_makepoint(7,5)), 4326), pos); explain (analyze,verbose,timing,costs,buffers) select * from f where pos @ st_setsrid(st_makebox2d(st_makepoint(0,0), st_makepoint(1,3)), 4326) or pos @ st_setsrid(st_makebox2d(st_makepoint(1,2.5), st_makepoint(5,3)), 4326) or pos @ st_setsrid(st_makebox2d(st_makepoint(5,2.5), st_makepoint(6,5)), 4326) or pos @ st_setsrid(st_makebox2d(st_makepoint(6,4), st_makepoint(7,5)), 4326);
1. After optimizing the means of combining 1 and 2:
Query the object that satisfies the BOUND BOX condition of the object covered by the BOUND BOX of this polygon. From scanning 1648 data blocks to accessing 243 data blocks. A qualitative leap.
explain (analyze,verbose,timing,costs,buffers) select * from f where pos @ st_setsrid(st_makebox2d(st_makepoint(0,0), st_makepoint(1,3)), 4326) or pos @ st_setsrid(st_makebox2d(st_makepoint(1,2.5), st_makepoint(5,3)), 4326) or pos @ st_setsrid(st_makebox2d(st_makepoint(5,2.5), st_makepoint(6,5)), 4326) or pos @ st_setsrid(st_makebox2d(st_makepoint(6,4), st_makepoint(7,5)), 4326); QUERY PLAN ----------------------- Bitmap Heap Scan on public.f (cost=10000000690.01..10000037405.46 rows=39940 width=40) (actual time=1.502..2.329 rows=8491 loops=1) Output: id, pos Recheck Cond: ((f.pos @ '0103000020E610000001000000050000000000000000000000000000000000000000000000000000000000000000000840000000000000F03F0000000000000840000000000000F03F000000000000000000000000000000000000000000000000'::geometry) OR (f.pos @ '0103000020E61000000100000005000000000000000000F03F0000000000000440000000000000F03F00000000000008400000000000001440000000000000084000000000000014400000000000000440000000000000F03F0000000000000440'::geometry) OR (f.pos @ '0103000020E610000001000000050000000000000000001440000000000000044000000000000014400000000000001440000000000000184000000000000014400000000000001840000000000000044000000000000014400000000000000440'::geometry) OR (f.pos @ '0103000020E6100000010000000500000000000000000018400000000000001040000000000000184000000000000014400000000000001C4000000000000014400000000000001C40000000000000104000000000000018400000000000001040'::geometry)) Heap Blocks: exact=119 Buffers: shared hit=243 -> BitmapOr (cost=690.01..690.01 rows=40000 width=0) (actual time=1.483..1.483 rows=0 loops=1) Buffers: shared hit=124 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.461..0.461 rows=3077 loops=1) Index Cond: (f.pos @ '0103000020E610000001000000050000000000000000000000000000000000000000000000000000000000000000000840000000000000F03F0000000000000840000000000000F03F000000000000000000000000000000000000000000000000'::geometry) Buffers: shared hit=37 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.423..0.423 rows=1991 loops=1) Index Cond: (f.pos @ '0103000020E61000000100000005000000000000000000F03F0000000000000440000000000000F03F00000000000008400000000000001440000000000000084000000000000014400000000000000440000000000000F03F0000000000000440'::geometry) Buffers: shared hit=33 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.366..0.366 rows=2435 loops=1) Index Cond: (f.pos @ '0103000020E610000001000000050000000000000000001440000000000000044000000000000014400000000000001440000000000000184000000000000014400000000000001840000000000000044000000000000014400000000000000440'::geometry) Buffers: shared hit=31 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.232..0.232 rows=988 loops=1) Index Cond: (f.pos @ '0103000020E6100000010000000500000000000000000018400000000000001040000000000000184000000000000014400000000000001C4000000000000014400000000000001C40000000000000104000000000000018400000000000001040'::geometry) Buffers: shared hit=23 Planning time: 0.104 ms Execution time: 2.751 ms (21 rows)
2. After optimizing the means of combination 1 and 2:
Query the object contained in this polygon. From scanning 1648 data blocks to accessing 243 data blocks. A qualitative leap.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from f where st_contains(st_setsrid(st_makebox2d(st_makepoint(0,0), st_makepoint(1,3)), 4326), pos) or st_contains(st_setsrid(st_makebox2d(st_makepoint(1,2.5), st_makepoint(5,3)), 4326), pos) or st_contains(st_setsrid(st_makebox2d(st_makepoint(5,2.5), st_makepoint(6,5)), 4326), pos) or st_contains(st_setsrid(st_makebox2d(st_makepoint(6,4), st_makepoint(7,5)), 4326), pos); QUERY PLAN -------------------------------------------- Bitmap Heap Scan on public.f (cost=663.40..77378.85 rows=13327 width=40) (actual time=1.496..11.038 rows=8491 loops=1) Output: id, pos Recheck Cond: (('0103000020E610000001000000050000000000000000000000000000000000000000000000000000000000000000000840000000000000F03F0000000000000840000000000000F03F000000000000000000000000000000000000000000000000'::geometry ~ f.pos) OR ('0103000020E61000000100000005000000000000000000F03F0000000000000440000000000000F03F00000000000008400000000000001440000000000000084000000000000014400000000000000440000000000000F03F0000000000000440'::geometry ~ f.pos) OR ('0103000020E610000001000000050000000000000000001440000000000000044000000000000014400000000000001440000000000000184000000000000014400000000000001840000000000000044000000000000014400000000000000440'::geometry ~ f.pos) OR ('0103000020E6100000010000000500000000000000000018400000000000001040000000000000184000000000000014400000000000001C4000000000000014400000000000001C40000000000000104000000000000018400000000000001040'::geometry ~ f.pos)) Filter: ((('0103000020E610000001000000050000000000000000000000000000000000000000000000000000000000000000000840000000000000F03F0000000000000840000000000000F03F000000000000000000000000000000000000000000000000'::geometry ~ f.pos) AND _st_contains('0103000020E610000001000000050000000000000000000000000000000000000000000000000000000000000000000840000000000000F03F0000000000000840000000000000F03F000000000000000000000000000000000000000000000000'::geometry, f.pos)) OR (('0103000020E61000000100000005000000000000000000F03F0000000000000440000000000000F03F00000000000008400000000000001440000000000000084000000000000014400000000000000440000000000000F03F0000000000000440'::geometry ~ f.pos) AND _st_contains('0103000020E61000000100000005000000000000000000F03F0000000000000440000000000000F03F00000000000008400000000000001440000000000000084000000000000014400000000000000440000000000000F03F0000000000000440'::geometry, f.pos)) OR (('0103000020E610000001000000050000000000000000001440000000000000044000000000000014400000000000001440000000000000184000000000000014400000000000001840000000000000044000000000000014400000000000000440'::geometry ~ f.pos) AND _st_contains('0103000020E610000001000000050000000000000000001440000000000000044000000000000014400000000000001440000000000000184000000000000014400000000000001840000000000000044000000000000014400000000000000440'::geometry, f.pos)) OR (('0103000020E6100000010000000500000000000000000018400000000000001040000000000000184000000000000014400000000000001C4000000000000014400000000000001C40000000000000104000000000000018400000000000001040'::geometry ~ f.pos) AND _st_contains('0103000020E6100000010000000500000000000000000018400000000000001040000000000000184000000000000014400000000000001C4000000000000014400000000000001C40000000000000104000000000000018400000000000001040'::geometry, f.pos))) Heap Blocks: exact=119 Buffers: shared hit=243 -> BitmapOr (cost=663.40..663.40 rows=40000 width=0) (actual time=1.472..1.472 rows=0 loops=1) Buffers: shared hit=124 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.436..0.436 rows=3077 loops=1) Index Cond: ('0103000020E610000001000000050000000000000000000000000000000000000000000000000000000000000000000840000000000000F03F0000000000000840000000000000F03F000000000000000000000000000000000000000000000000'::geometry ~ f.pos) Buffers: shared hit=37 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.438..0.438 rows=1991 loops=1) Index Cond: ('0103000020E61000000100000005000000000000000000F03F0000000000000440000000000000F03F00000000000008400000000000001440000000000000084000000000000014400000000000000440000000000000F03F0000000000000440'::geometry ~ f.pos) Buffers: shared hit=33 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.365..0.365 rows=2435 loops=1) Index Cond: ('0103000020E610000001000000050000000000000000001440000000000000044000000000000014400000000000001440000000000000184000000000000014400000000000001840000000000000044000000000000014400000000000000440'::geometry ~ f.pos) Buffers: shared hit=31 -> Bitmap Index Scan on idx_f (cost=0.00..162.52 rows=10000 width=0) (actual time=0.234..0.234 rows=988 loops=1) Index Cond: ('0103000020E6100000010000000500000000000000000018400000000000001040000000000000184000000000000014400000000000001C4000000000000014400000000000001C40000000000000104000000000000018400000000000001040'::geometry ~ f.pos) Buffers: shared hit=23 Planning time: 0.163 ms Execution time: 11.497 ms (22 rows)
Optimizing method 2, splitting the long polygon into several small boxes, eliminating the large bound box, and reducing the search BLOCK to 243 again. A qualitative leap.
The combination of the two means has played a good role in the combination of two swords.
st_split object segmentation
PostGIS provides a method of object segmentation.
http://postgis.net/docs/manual-2.4/ST_Split.html
-- this creates a geometry collection consisting of the 2 halves of the polygon -- this is similar to the example we demonstrated in ST_BuildArea SELECT ST_Split(circle, line) FROM (SELECT ST_MakeLine(ST_MakePoint(10, 10),ST_MakePoint(190, 190)) As line, ST_Buffer(ST_GeomFromText('POINT(100 90)'), 50) As circle) As foo; -- result -- GEOMETRYCOLLECTION(POLYGON((150 90,149.039264020162 80.2454838991936,146.193976625564 70.8658283817455,...), POLYGON(...))) -- To convert to individual polygons, you can use ST_Dump or ST_GeometryN SELECT ST_AsText((ST_Dump(ST_Split(circle, line))).geom) As wkt FROM (SELECT ST_MakeLine(ST_MakePoint(10, 10),ST_MakePoint(190, 190)) As line, ST_Buffer(ST_GeomFromText('POINT(100 90)'), 50) As circle) As foo; -- result -- wkt --------------- POLYGON((150 90,149.039264020162 80.2454838991936,...)) POLYGON((60.1371179574584 60.1371179574584,58.4265193848728 62.2214883490198,53.8060233744357 ...))
SELECT ST_AsText(ST_Split(mline, pt)) As wktcut FROM (SELECT ST_GeomFromText('MULTILINESTRING((10 10, 190 190), (15 15, 30 30, 100 90))') As mline, ST_Point(30,30) As pt) As foo; wktcut ------ GEOMETRYCOLLECTION( LINESTRING(10 10,30 30), LINESTRING(30 30,190 190), LINESTRING(15 15,30 30), LINESTRING(30 30,100 90) )
I later wrote a document to simplify SPLIT:
PostgreSQL Spatial Cutting (st_split) Function Extension - Spatial Object Griding
st_snap
http://postgis.net/docs/manual-2.4/ST_Snap.html
Differences between @, ~and ST_Contains, ST_Within
What's the difference between @, ~and ST_Contains, ST_Within, which are all operators or functions contained in objects?
@
A @ B
Returns TRUE if A's bounding box is contained by B's.
~
Contrary to @.
A ~ B
Returns TRUE if A's bounding box contains B's.
ST_Contains
ST_Contains(A, B)
Returns true if and only if no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A.
ST_Within
Contrary to ST_Contains.
ST_Within(A, B)
Returns true if the geometry A is completely inside geometry B
Difference
The operations of @ and ~are not directed at geometric objects, but for boundboxes of A and B, that is, BOX consisting of points on the bottom left and upper right of the object. ST_Within and ST_Contains are for geometric objects, but from the point of view of GiST index search, it is necessary to use BOUND BOX to search first, and then use CPU to calculate and judge.
Example
A @ Polygon, return to true B @ Polygon, return to true C @ Polygon, return to true ST_Contains(Polygon, A), return false ST_Contains(Polygon, B), returns true ST_Contains(Polygon, C), return false
Summary
Two optimizable points of spatial search are as follows:
1. Spatial data is stored in disorder during storage, resulting in a lot of data blocks scanned when searching for a batch of data. (You can't feel the problem by checking.)
2. The GiST spatial index of PostGIS uses BOUND BOX as KEY, and the BOUND BOX of the object is also used for searching. Therefore, when the object is a long bar, it may cause a large number of BOUND BOX holes, enlarge the scanning range (for st_contains, st_within), and increase the cost of CPU filtering.
Optimizing means 1: space aggregation to solve the problem of IO amplification.
Optimizing method 2: SPLIT is applied to input condition (polygon of strip) to reduce the scanning range introduced by BOUND BOX amplification (for st_contains, st_within).
Data volume: 10 million.
Point-to-surface judgment (strip polygon, or spacial object covered by discrete polygon objects).
Before optimization | Optimize 1 (Spatial Aggregation) | Optimize 1,2(SPLIT polygon) |
---|---|---|
Access 35323 blocks | Access 1648 blocks | Visit 243 blocks |
Filtration 26590 | Filtration 26590 | Filter 0 |
Reference resources
Understanding the Construction of GiST Index through Spatial Thought
PostgreSQL Black Technology - Spatial Aggregate Storage
PostGIS Spatial Data Learning Suggestions
PostgreSQL Spatial Cutting (st_split) Function Extension - Spatial Object Griding
http://postgis.net/docs/manual-2.4/ST_Within.html
http://postgis.net/docs/manual-2.4/ST_Contains.html
http://postgis.net/docs/manual-2.4/ST_Geometry_Contained.html
http://postgis.net/docs/manual-2.4/ST_Geometry_Contain.html