Oracle database statistical experiment

Keywords: Oracle

This experiment comes from Oracle developer performance lesson 2: Module 2: What are Database Statistics? It is a repetition and interpretation of this experiment.

First, create two tables for the experiment, namely bricks and colors, and generate statistical information:

-- According to document Database Reference,statistics_level The default value for is typical,Set as all After, it will increase timed operating system statistics and plan execution statistics. 
alter session set statistics_level = all;

create table bricks (
  brick_id         integer not null primary key,
  colour_rgb_value varchar2(10) not null,
  shape            varchar2(10) not null,
  weight           integer not null
);

create table colours (
  colour_rgb_value varchar2(10) not null,
  colour_name      varchar2(10) not null
);

insert into colours values ( 'FF0000', 'red' );
insert into colours values ( '00FF00', 'green' );
insert into colours values ( '0000FF', 'blue' );

insert into bricks
  select rownum,
         case mod ( level, 3 )
           when 0 then 'FF0000'
           when 1 then '00FF00'
           when 2 then '0000FF'
         end,
         case mod ( level, 3 )
           when 0 then 'cylinder'
           when 1 then 'cube'
           when 2 then 'pyramid'
         end,
         floor ( 100 / rownum )
  from   dual
  connect by level <= 100;
  
insert into bricks
  select rownum + 1000,
         case mod ( level, 3 )
           when 0 then 'FF0000'
           when 1 then '00FF00'
           when 2 then '0000FF'
         end,
         case mod ( level, 3 )
           when 0 then 'cylinder'
           when 1 then 'cube'
           when 2 then 'pyramid'
         end,
         floor ( 200 / rownum )
  from   dual
  connect by level <= 200;

commit;

declare
  stats dbms_stats.statrec;
  distcnt  number; 
  density  number;
  nullcnt  number; 
  avgclen  number;
begin

  dbms_stats.gather_table_stats ( null, 'colours' );
  dbms_stats.gather_table_stats ( null, 'bricks' );
  dbms_stats.set_table_stats ( null, 'bricks', numrows => 30 );
  dbms_stats.set_table_stats ( null, 'colours', numrows => 3000 );
  dbms_stats.get_column_stats ( null, 'colours', 'colour_rgb_value', 
    distcnt => distcnt, 
    density => density,
    nullcnt => nullcnt, 
    avgclen => avgclen,
    srec => stats
  );
  stats.minval := utl_raw.cast_to_raw ( '0000FF' );
  stats.maxval := utl_raw.cast_to_raw ( 'FF0000' );
  dbms_stats.set_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => 10, srec => stats );
  dbms_stats.set_column_stats ( null, 'bricks', 'colour_rgb_value', distcnt => 10, srec => stats );

end;
/

After inserting data, the colors table has only three rows:

select /*ansiconsole*/ * from colours
COLOUR_RGB_VALUE   COLOUR_NAME   
FF0000              red            
00FF00              green          
0000FF              blue     

The bricks table is inserted twice, with 100 rows for the first time_ ID from 1 to 100; The second 200 lines, brick_ IDS range from 1001 to 1200. There are 3 colors, press brick_id cycle; There are also three shapes (cylinder, cube and vertebral body), according to brick_id cycle. The weight is floor (100 / rownum) and floor (200 / rownum), so the weight distribution is very uneven.

After inserting data, use gather immediately_ table_ Stats and get_column_stats generates statistics for tables and columns. The following are the correct statistics:

select ut.table_name, ut.num_rows, 
       utcs.column_name, utcs.num_distinct, 
       case utc.data_type
         when 'VARCHAR2' then
           utl_raw.cast_to_varchar2 ( utcs.low_value ) 
        when 'NUMBER' then
           to_char ( utl_raw.cast_to_number ( utcs.low_value ) )
       end low_val, 
       case utc.data_type
         when 'VARCHAR2' then
           utl_raw.cast_to_varchar2 ( utcs.high_value ) 
         when 'NUMBER' then
           to_char ( utl_raw.cast_to_number ( utcs.high_value ) )
       end high_val
from   user_tables ut 
join   user_tab_cols utc
on     ut.table_name = utc.table_name
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
and    utc.column_name = utcs.column_name
and ut.table_name in ('BRICKS', 'COLOURS')
order  by ut.table_name, utcs.column_name;

   TABLE_NAME    NUM_ROWS         COLUMN_NAME    NUM_DISTINCT    LOW_VAL    HIGH_VAL
_____________ ___________ ___________________ _______________ __________ ___________
BRICKS                300 BRICK_ID                        300 1          1200
BRICKS                300 COLOUR_RGB_VALUE                  3 0000FF     FF0000
BRICKS                300 SHAPE                             3 cube       pyramid
BRICKS                300 WEIGHT                           27 1          200
COLOURS                 3 COLOUR_NAME                       3 blue       red
COLOURS                 3 COLOUR_RGB_VALUE                  3 0000FF     FF0000

6 rows selected.

In order to make the optimizer misjudge, set is used again_ table_ Stats and set_column_stats modifies the statistics of tables and columns. The modified statistics are as follows:

   TABLE_NAME    NUM_ROWS         COLUMN_NAME    NUM_DISTINCT    LOW_VAL    HIGH_VAL
_____________ ___________ ___________________ _______________ __________ ___________
BRICKS                 30 BRICK_ID                        300 1          1200
BRICKS                 30 COLOUR_RGB_VALUE                 10 0000FF     FF0000
BRICKS                 30 SHAPE                             3 cube       pyramid
BRICKS                 30 WEIGHT                           27 1          200
COLOURS              3000 COLOUR_NAME                       3 blue       red
COLOURS              3000 COLOUR_RGB_VALUE                 10 0000FF     FF0000

6 rows selected.

The null parameter specified in the above procedure indicates that the current schema is used.

The following items have been specifically modified:

  • The number of rows in the bricks table is changed from 300 to 30
  • The number of rows in the colors table is changed from 3 to 3000
  • Colour of bricks table_ rgb_ Value column, num_distinct changed from 3 to 10
  • Colour of the colours table_ rgb_ Value column, num_distinct changed from 3 to 10

So far, there has been a great deviation in the statistical information.

Row statistics

Execute the following SQL and get the execution plan:

set pages 9999
set lines 120

select /*+ gather_plan_statistics */ c.colour_name, count (*)
from   bricks b
join   colours c
on     c.colour_rgb_value = b.colour_rgb_value
group  by c.colour_name;

select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

------------------------------------------------------------------
| Id  | Operation           | Name    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |      1 |        |      3 |
|   1 |  HASH GROUP BY      |         |      1 |      3 |      3 |
|*  2 |   HASH JOIN         |         |      1 |   9000 |    300 |
|   3 |    TABLE ACCESS FULL| BRICKS  |      1 |     30 |    300 |
|   4 |    TABLE ACCESS FULL| COLOURS |      1 |   3000 |      3 |
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("C"."COLOUR_RGB_VALUE"="B"."COLOUR_RGB_VALUE")

Let's start with the first important principle:

The optimizer prefers to start a join with the table returning the fewest rows.

That is, the optimizer tends to Join from the table that returns the least rows first. In the execution plan, A-Rows represents the rows actually returned. Obviously, this execution plan does not conform to this principle, because it starts with the table bricks that returns 300 rows, and then processes the table colors that returns 3 rows. Obviously, this is misled by biased statistical information. It should be noted that the where condition is considered when returning less. For a table with 100 million rows, if only a few rows are returned, it is also considered to return less.

E-rows represents the number of rows estimated by the optimizer, and Starts represents the number of executions. The second principle is that if the result of E-Rows * Starts is close to that of A-Rows, it means it is a good implementation plan.

Therefore, we need to correct the statistics:

exec dbms_stats.gather_table_stats ( null, 'colours' ) ;
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

SELECT
    ut.table_name,
    ut.num_rows,
    utcs.column_name,
    utcs.num_distinct
FROM
         user_tables ut
    JOIN user_tab_col_statistics utcs ON ut.table_name = utcs.table_name
WHERE
    ut.table_name IN ( 'COLOURS', 'BRICKS' );

   TABLE_NAME    NUM_ROWS         COLUMN_NAME    NUM_DISTINCT
_____________ ___________ ___________________ _______________
BRICKS                300 BRICK_ID                        300
BRICKS                300 COLOUR_RGB_VALUE                  3
BRICKS                300 SHAPE                             3
BRICKS                300 WEIGHT                           27
COLOURS                 3 COLOUR_RGB_VALUE                  3
COLOURS                 3 COLOUR_NAME                       3

6 rows selected.

Oracle can automatically collect statistics in two cases:

  • Daily maintenance window
  • When the number of rows changed by the table exceeds the threshold (10% by default)

This threshold can be modified. For example, the following changes the default threshold to 1%:

select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
exec dbms_stats.set_table_prefs ( null, 'colours', 'STALE_PERCENT', 1 );

Updating statistics may cause the optimizer to generate a new execution plan for SQL. The optimizer will decide when to invalidate the previous execution plan. Therefore, the execution plan is not necessarily updated immediately after the statistical information is updated. You can use the following procedure to force updates:

exec dbms_stats.gather_table_stats ( null, 'colours', no_invalidate => false ) ;

So far, the row Statistics (num_rows and num_distinct) have been correct.

Column Histogram to solve the problem of data deviation

Histogram is translated into histogram, which is not the best translation.

Let's take a look at Data Skew.

Execute the following SQL:

SELECT /*+ gather_plan_statistics */ COUNT(*)
FROM
    bricks
WHERE
    weight = 1;

SELECT
    *
FROM
    TABLE ( dbms_xplan.display_cursor(format => 'ROWSTATS LAST') );

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |        |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| BRICKS |      1 |     11 |    150 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("WEIGHT"=1)

Obviously, this is not a good implementation plan. A-Rows has 150 rows. Why does E-Rows only estimate 11 rows? The reason is that the optimizer believes that the values are evenly distributed. This example demonstrates the Value Skew in data deviation:

SQL> select floor(count(*) / count(distinct weight)) from bricks;

   FLOOR(COUNT(*)/COUNT(DISTINCTWEIGHT))
________________________________________
                                      11

There is another form of data deviation, which is related to data distribution, namely Range Skew, for example:

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 0 and 100;

select * 
from   table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 400 and 500;

select * 
from   table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 1000 and 1100;

select * 
from   table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

Their implementation plan is as follows:

--------------------------------------------------------------------
| Id  | Operation         | Name        | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |             |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| SYS_C008566 |      1 |     26 |    100 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BRICK_ID">=0 AND "BRICK_ID"<=100)

...

--------------------------------------------------------------------
| Id  | Operation         | Name        | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |             |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| SYS_C008566 |      1 |     27 |      0 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BRICK_ID">=400 AND "BRICK_ID"<=500)

...

--------------------------------------------------------------------
| Id  | Operation         | Name        | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |             |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| SYS_C008566 |      1 |     27 |    100 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BRICK_ID">=1000 AND "BRICK_ID"<=1100)

Why do we estimate 26 or 27? This is because the range 100 is 1 / 12 of 1200. And the actual brick_ There are 300 IDs. Therefore, 300 / 12 is 25, which is similar to 26,27.

select min(brick_id), max(brick_id), count(distinct(brick_id)) from bricks;
MIN(BRICK_ID) MAX(BRICK_ID) COUNT(DISTINCT(BRICK_ID))
------------- ------------- -------------------------
            1          1200                       300

The actual distribution is as follows, which is consistent with A-Rows:

with rws as (
  select level r from dual
  connect by level <= 15
)
  select (r-1) * 100 as lower_limit, r * 100 as upper_limit, count ( brick_id ) 
  from   rws
  left   join bricks
  on     ceil ( brick_id / 100 ) = r
  group  by r
  order  by r;

LOWER_LIMIT UPPER_LIMIT COUNT(BRICK_ID)
----------- ----------- ---------------
          0         100             100
        100         200               0
        200         300               0
        300         400               0
        400         500               0
        500         600               0
        600         700               0
        700         800               0
        800         900               0
        900        1000               0
       1000        1100             100
       1100        1200             100
       1200        1300               0
       1300        1400               0
       1400        1500               0

15 rows selected. 

To solve the above problems, you need to generate histograms for the corresponding columns. First, confirm that there are no histograms at present, and use NUM_BUCKETS is equal to 1. It can be confirmed that:

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS'
and    utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );

   COLUMN_NAME    HISTOGRAM    NUM_BUCKETS
______________ ____________ ______________
BRICK_ID       NONE                      1
WEIGHT         NONE                      1

Collect statistics, and you can automatically generate histograms for columns. Why is it automatic this time? Because the data itself has value and range deviation, and the three statements previously executed meet certain conditions.

exec dbms_stats.gather_table_stats ( null, 'bricks', no_invalidate => false ) ;

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS'
and    utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );

   COLUMN_NAME    HISTOGRAM    NUM_BUCKETS
______________ ____________ ______________
BRICK_ID       HYBRID                  254
WEIGHT         FREQUENCY                27

These conditions are:

  1. The column has value skew and statements use the column in range (<, >=, etc.), LIKE, or equality conditions
  2. The column has range skew and the column is used in range or LIKE conditions.
  3. The column has a small number of distinct values (with some repeated values) and the column is used in range (<, >=, etc.), LIKE, or equality conditions
  4. It may also capture histograms when using incremental statistics are used, even if there is no skew. These are ignored by optimizer stats and are out-of-scope for this tutorial.

Histogram has four types:

  • Frequency: the cardinality of the column is small.
  • Height balanced: (obsolete)
  • Hybrid: when the cardinality of the column is large.
  • Top frequency: when the cardinality of the column is large.

The following SQL prohibits collecting histograms for columns. The default value is for all columns size auto:

exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for all columns size 1' ) ;

Expand statistics to solve the problem of associated columns

Finally, take a look at Correlated Columns, which requires extended statistics.

In the following example, the number of rows estimated by the optimizer is obviously inaccurate:

select /*+ gather_plan_statistics */count (*)
from   bricks
where  colour_rgb_value = 'FF0000'
and    shape = 'cylinder';

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |        |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| BRICKS |      1 |     33 |     99 |
----------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(("COLOUR_RGB_VALUE"='FF0000' AND "SHAPE"='cylinder'))

This is because the bricks table has 300 rows, 3 colors and 3 shapes, so there are 9 combinations of colors and shapes. Therefore, the estimated row formula is 300 / 9, which is about 33. But in fact, according to the rules of the initial data generation, each shape has only one color, so there is a deviation.

Generate extended statistics from the following SQL:

select dbms_stats.create_extended_stats ( null, 'bricks', '(colour_rgb_value, shape)' )
from   dual;

exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for columns (colour_rgb_value, shape)', no_invalidate => false ) ;

This time, the statistics are accurate:

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |        |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| BRICKS |      1 |     99 |     99 |
----------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(("COLOUR_RGB_VALUE"='FF0000' AND "SHAPE"='cylinder'))  

Posted by ptolomea on Mon, 25 Oct 2021 06:34:28 -0700