This experiment comes from Oracle developer performance lesson 2: Module 2: What are Database Statistics? It is a repetition and interpretation of this experiment.
First, create two tables for the experiment, namely bricks and colors, and generate statistical information:
-- According to document Database Reference,statistics_level The default value for is typical,Set as all After, it will increase timed operating system statistics and plan execution statistics. alter session set statistics_level = all; create table bricks ( brick_id integer not null primary key, colour_rgb_value varchar2(10) not null, shape varchar2(10) not null, weight integer not null ); create table colours ( colour_rgb_value varchar2(10) not null, colour_name varchar2(10) not null ); insert into colours values ( 'FF0000', 'red' ); insert into colours values ( '00FF00', 'green' ); insert into colours values ( '0000FF', 'blue' ); insert into bricks select rownum, case mod ( level, 3 ) when 0 then 'FF0000' when 1 then '00FF00' when 2 then '0000FF' end, case mod ( level, 3 ) when 0 then 'cylinder' when 1 then 'cube' when 2 then 'pyramid' end, floor ( 100 / rownum ) from dual connect by level <= 100; insert into bricks select rownum + 1000, case mod ( level, 3 ) when 0 then 'FF0000' when 1 then '00FF00' when 2 then '0000FF' end, case mod ( level, 3 ) when 0 then 'cylinder' when 1 then 'cube' when 2 then 'pyramid' end, floor ( 200 / rownum ) from dual connect by level <= 200; commit; declare stats dbms_stats.statrec; distcnt number; density number; nullcnt number; avgclen number; begin dbms_stats.gather_table_stats ( null, 'colours' ); dbms_stats.gather_table_stats ( null, 'bricks' ); dbms_stats.set_table_stats ( null, 'bricks', numrows => 30 ); dbms_stats.set_table_stats ( null, 'colours', numrows => 3000 ); dbms_stats.get_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => distcnt, density => density, nullcnt => nullcnt, avgclen => avgclen, srec => stats ); stats.minval := utl_raw.cast_to_raw ( '0000FF' ); stats.maxval := utl_raw.cast_to_raw ( 'FF0000' ); dbms_stats.set_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => 10, srec => stats ); dbms_stats.set_column_stats ( null, 'bricks', 'colour_rgb_value', distcnt => 10, srec => stats ); end; /
After inserting data, the colors table has only three rows:
select /*ansiconsole*/ * from colours COLOUR_RGB_VALUE COLOUR_NAME FF0000 red 00FF00 green 0000FF blue
The bricks table is inserted twice, with 100 rows for the first time_ ID from 1 to 100; The second 200 lines, brick_ IDS range from 1001 to 1200. There are 3 colors, press brick_id cycle; There are also three shapes (cylinder, cube and vertebral body), according to brick_id cycle. The weight is floor (100 / rownum) and floor (200 / rownum), so the weight distribution is very uneven.
After inserting data, use gather immediately_ table_ Stats and get_column_stats generates statistics for tables and columns. The following are the correct statistics:
select ut.table_name, ut.num_rows, utcs.column_name, utcs.num_distinct, case utc.data_type when 'VARCHAR2' then utl_raw.cast_to_varchar2 ( utcs.low_value ) when 'NUMBER' then to_char ( utl_raw.cast_to_number ( utcs.low_value ) ) end low_val, case utc.data_type when 'VARCHAR2' then utl_raw.cast_to_varchar2 ( utcs.high_value ) when 'NUMBER' then to_char ( utl_raw.cast_to_number ( utcs.high_value ) ) end high_val from user_tables ut join user_tab_cols utc on ut.table_name = utc.table_name join user_tab_col_statistics utcs on ut.table_name = utcs.table_name and utc.column_name = utcs.column_name and ut.table_name in ('BRICKS', 'COLOURS') order by ut.table_name, utcs.column_name; TABLE_NAME NUM_ROWS COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL _____________ ___________ ___________________ _______________ __________ ___________ BRICKS 300 BRICK_ID 300 1 1200 BRICKS 300 COLOUR_RGB_VALUE 3 0000FF FF0000 BRICKS 300 SHAPE 3 cube pyramid BRICKS 300 WEIGHT 27 1 200 COLOURS 3 COLOUR_NAME 3 blue red COLOURS 3 COLOUR_RGB_VALUE 3 0000FF FF0000 6 rows selected.
In order to make the optimizer misjudge, set is used again_ table_ Stats and set_column_stats modifies the statistics of tables and columns. The modified statistics are as follows:
TABLE_NAME NUM_ROWS COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL _____________ ___________ ___________________ _______________ __________ ___________ BRICKS 30 BRICK_ID 300 1 1200 BRICKS 30 COLOUR_RGB_VALUE 10 0000FF FF0000 BRICKS 30 SHAPE 3 cube pyramid BRICKS 30 WEIGHT 27 1 200 COLOURS 3000 COLOUR_NAME 3 blue red COLOURS 3000 COLOUR_RGB_VALUE 10 0000FF FF0000 6 rows selected.
The null parameter specified in the above procedure indicates that the current schema is used.
The following items have been specifically modified:
- The number of rows in the bricks table is changed from 300 to 30
- The number of rows in the colors table is changed from 3 to 3000
- Colour of bricks table_ rgb_ Value column, num_distinct changed from 3 to 10
- Colour of the colours table_ rgb_ Value column, num_distinct changed from 3 to 10
So far, there has been a great deviation in the statistical information.
Row statistics
Execute the following SQL and get the execution plan:
set pages 9999 set lines 120 select /*+ gather_plan_statistics */ c.colour_name, count (*) from bricks b join colours c on c.colour_rgb_value = b.colour_rgb_value group by c.colour_name; select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST')); ------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | ------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 3 | | 1 | HASH GROUP BY | | 1 | 3 | 3 | |* 2 | HASH JOIN | | 1 | 9000 | 300 | | 3 | TABLE ACCESS FULL| BRICKS | 1 | 30 | 300 | | 4 | TABLE ACCESS FULL| COLOURS | 1 | 3000 | 3 | ------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("C"."COLOUR_RGB_VALUE"="B"."COLOUR_RGB_VALUE")
Let's start with the first important principle:
The optimizer prefers to start a join with the table returning the fewest rows.
That is, the optimizer tends to Join from the table that returns the least rows first. In the execution plan, A-Rows represents the rows actually returned. Obviously, this execution plan does not conform to this principle, because it starts with the table bricks that returns 300 rows, and then processes the table colors that returns 3 rows. Obviously, this is misled by biased statistical information. It should be noted that the where condition is considered when returning less. For a table with 100 million rows, if only a few rows are returned, it is also considered to return less.
E-rows represents the number of rows estimated by the optimizer, and Starts represents the number of executions. The second principle is that if the result of E-Rows * Starts is close to that of A-Rows, it means it is a good implementation plan.
Therefore, we need to correct the statistics:
exec dbms_stats.gather_table_stats ( null, 'colours' ) ; exec dbms_stats.gather_table_stats ( null, 'bricks' ) ; SELECT ut.table_name, ut.num_rows, utcs.column_name, utcs.num_distinct FROM user_tables ut JOIN user_tab_col_statistics utcs ON ut.table_name = utcs.table_name WHERE ut.table_name IN ( 'COLOURS', 'BRICKS' ); TABLE_NAME NUM_ROWS COLUMN_NAME NUM_DISTINCT _____________ ___________ ___________________ _______________ BRICKS 300 BRICK_ID 300 BRICKS 300 COLOUR_RGB_VALUE 3 BRICKS 300 SHAPE 3 BRICKS 300 WEIGHT 27 COLOURS 3 COLOUR_RGB_VALUE 3 COLOURS 3 COLOUR_NAME 3 6 rows selected.
Oracle can automatically collect statistics in two cases:
- Daily maintenance window
- When the number of rows changed by the table exceeds the threshold (10% by default)
This threshold can be modified. For example, the following changes the default threshold to 1%:
select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual; exec dbms_stats.set_table_prefs ( null, 'colours', 'STALE_PERCENT', 1 );
Updating statistics may cause the optimizer to generate a new execution plan for SQL. The optimizer will decide when to invalidate the previous execution plan. Therefore, the execution plan is not necessarily updated immediately after the statistical information is updated. You can use the following procedure to force updates:
exec dbms_stats.gather_table_stats ( null, 'colours', no_invalidate => false ) ;
So far, the row Statistics (num_rows and num_distinct) have been correct.
Column Histogram to solve the problem of data deviation
Histogram is translated into histogram, which is not the best translation.
Let's take a look at Data Skew.
Execute the following SQL:
SELECT /*+ gather_plan_statistics */ COUNT(*) FROM bricks WHERE weight = 1; SELECT * FROM TABLE ( dbms_xplan.display_cursor(format => 'ROWSTATS LAST') ); ---------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | ---------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| BRICKS | 1 | 11 | 150 | ---------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("WEIGHT"=1)
Obviously, this is not a good implementation plan. A-Rows has 150 rows. Why does E-Rows only estimate 11 rows? The reason is that the optimizer believes that the values are evenly distributed. This example demonstrates the Value Skew in data deviation:
SQL> select floor(count(*) / count(distinct weight)) from bricks; FLOOR(COUNT(*)/COUNT(DISTINCTWEIGHT)) ________________________________________ 11
There is another form of data deviation, which is related to data distribution, namely Range Skew, for example:
select /*+ gather_plan_statistics */count (*) from bricks where brick_id between 0 and 100; select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST')); select /*+ gather_plan_statistics */count (*) from bricks where brick_id between 400 and 500; select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST')); select /*+ gather_plan_statistics */count (*) from bricks where brick_id between 1000 and 1100; select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));
Their implementation plan is as follows:
-------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | -------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | INDEX RANGE SCAN| SYS_C008566 | 1 | 26 | 100 | -------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("BRICK_ID">=0 AND "BRICK_ID"<=100) ... -------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | -------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | INDEX RANGE SCAN| SYS_C008566 | 1 | 27 | 0 | -------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("BRICK_ID">=400 AND "BRICK_ID"<=500) ... -------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | -------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | INDEX RANGE SCAN| SYS_C008566 | 1 | 27 | 100 | -------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("BRICK_ID">=1000 AND "BRICK_ID"<=1100)
Why do we estimate 26 or 27? This is because the range 100 is 1 / 12 of 1200. And the actual brick_ There are 300 IDs. Therefore, 300 / 12 is 25, which is similar to 26,27.
select min(brick_id), max(brick_id), count(distinct(brick_id)) from bricks; MIN(BRICK_ID) MAX(BRICK_ID) COUNT(DISTINCT(BRICK_ID)) ------------- ------------- ------------------------- 1 1200 300
The actual distribution is as follows, which is consistent with A-Rows:
with rws as ( select level r from dual connect by level <= 15 ) select (r-1) * 100 as lower_limit, r * 100 as upper_limit, count ( brick_id ) from rws left join bricks on ceil ( brick_id / 100 ) = r group by r order by r; LOWER_LIMIT UPPER_LIMIT COUNT(BRICK_ID) ----------- ----------- --------------- 0 100 100 100 200 0 200 300 0 300 400 0 400 500 0 500 600 0 600 700 0 700 800 0 800 900 0 900 1000 0 1000 1100 100 1100 1200 100 1200 1300 0 1300 1400 0 1400 1500 0 15 rows selected.
To solve the above problems, you need to generate histograms for the corresponding columns. First, confirm that there are no histograms at present, and use NUM_BUCKETS is equal to 1. It can be confirmed that:
select utcs.column_name, utcs.histogram, utcs.num_buckets from user_tables ut join user_tab_col_statistics utcs on ut.table_name = utcs.table_name where ut.table_name = 'BRICKS' and utcs.column_name in ( 'BRICK_ID', 'WEIGHT' ); COLUMN_NAME HISTOGRAM NUM_BUCKETS ______________ ____________ ______________ BRICK_ID NONE 1 WEIGHT NONE 1
Collect statistics, and you can automatically generate histograms for columns. Why is it automatic this time? Because the data itself has value and range deviation, and the three statements previously executed meet certain conditions.
exec dbms_stats.gather_table_stats ( null, 'bricks', no_invalidate => false ) ; select utcs.column_name, utcs.histogram, utcs.num_buckets from user_tables ut join user_tab_col_statistics utcs on ut.table_name = utcs.table_name where ut.table_name = 'BRICKS' and utcs.column_name in ( 'BRICK_ID', 'WEIGHT' ); COLUMN_NAME HISTOGRAM NUM_BUCKETS ______________ ____________ ______________ BRICK_ID HYBRID 254 WEIGHT FREQUENCY 27
These conditions are:
- The column has value skew and statements use the column in range (<, >=, etc.), LIKE, or equality conditions
- The column has range skew and the column is used in range or LIKE conditions.
- The column has a small number of distinct values (with some repeated values) and the column is used in range (<, >=, etc.), LIKE, or equality conditions
- It may also capture histograms when using incremental statistics are used, even if there is no skew. These are ignored by optimizer stats and are out-of-scope for this tutorial.
Histogram has four types:
- Frequency: the cardinality of the column is small.
- Height balanced: (obsolete)
- Hybrid: when the cardinality of the column is large.
- Top frequency: when the cardinality of the column is large.
The following SQL prohibits collecting histograms for columns. The default value is for all columns size auto:
exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for all columns size 1' ) ;
Expand statistics to solve the problem of associated columns
Finally, take a look at Correlated Columns, which requires extended statistics.
In the following example, the number of rows estimated by the optimizer is obviously inaccurate:
select /*+ gather_plan_statistics */count (*) from bricks where colour_rgb_value = 'FF0000' and shape = 'cylinder'; select * from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST')); ---------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | ---------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| BRICKS | 1 | 33 | 99 | ---------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("COLOUR_RGB_VALUE"='FF0000' AND "SHAPE"='cylinder'))
This is because the bricks table has 300 rows, 3 colors and 3 shapes, so there are 9 combinations of colors and shapes. Therefore, the estimated row formula is 300 / 9, which is about 33. But in fact, according to the rules of the initial data generation, each shape has only one color, so there is a deviation.
Generate extended statistics from the following SQL:
select dbms_stats.create_extended_stats ( null, 'bricks', '(colour_rgb_value, shape)' ) from dual; exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for columns (colour_rgb_value, shape)', no_invalidate => false ) ;
This time, the statistics are accurate:
---------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | ---------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | | 1 | SORT AGGREGATE | | 1 | 1 | 1 | |* 2 | TABLE ACCESS FULL| BRICKS | 1 | 99 | 99 | ---------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("COLOUR_RGB_VALUE"='FF0000' AND "SHAPE"='cylinder'))