[Hard ClickHouse] ClickHouse Advanced Optimize ClickHouse Syntax Optimize Rules

Keywords: SQL clickhouse

This article is about Learning Guide for Big Data Specialists from Zero (Full Upgrade) ClickHouse: Partial supplement.

0 ClickHouse syntax optimization rules

ClickHouse's SQL optimization rules are based on RBO(Rule Based Optimization). Here are some optimization rules

1 Prepare test tables

1) Upload official datasets

Upload visits_v1.tar and hits_v1.tar to the virtual machine and extract them to the clickhouse data path

// Unzip to clickhouse data path
sudo tar -xvf hits_v1.tar -C /var/lib/clickhouse
sudo tar -xvf visits_v1.tar -C /var/lib/clickhouse
//Modify Owned User
sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets

2) Restart clickhouse-server

sudo clickhouse restart

3) Execute queries

clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"

clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"

Note: The official tar package contains library building, table building statements, data content, which is the most convenient way without manual library building and table building.

The hits_v1 table has more than 130 fields and 8.8 million data

The visits_v1 table has more than 180 fields and more than 16 million pieces of data

2 COUNT optimization

When the count function is called, if count() or count(*) is used and there is no where condition, then

The total_rows of system.tables are used directly, for example:

EXPLAIN SELECT count()FROM datasets.hits_v1;
Union
 Expression (Projection)
     Expression (Before ORDER BY and SELECT)
         MergingAggregated
             ReadNothing (Optimized trivial count) 

Notice the Optimized trivial count, which is an optimization of count.

This optimization will not be used if the count specific column field:

EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1;
Union
 Expression (Projection)
     Expression (Before ORDER BY and SELECT)
         Aggregating
             Expression (Before GROUP BY)
                 ReadFromStorage (Read from MergeTree)

3 Eliminate subquery duplicate fields

The following statement has two duplicate id fields in the subquery that will be de-duplicated:

EXPLAIN SYNTAX SELECT 
     a.UserID,
     b.VisitID,
     a.URL,
     b.UserID
     FROM
     hits_v1 AS a 
     LEFT JOIN ( 
     SELECT 
         UserID, 
         UserID as HaHa, 
         VisitID 
 FROM visits_v1) AS b 
 USING (UserID)
 limit 3;
//Return the optimization statement:
SELECT 
     UserID,
     VisitID,
     URL,
     b.UserID
FROM hits_v1 AS a
ALL LEFT JOIN 
(
 SELECT 
 UserID,
 VisitID
 FROM visits_v1
) AS b USING (UserID)
LIMIT 3

4 Predicate Push Down

When a group by has a having clause but is not decorated with cube, with rollup, or with totals, the having filter pushes down to where to filter ahead. For example, in the query below, HAVING name becomes WHERE name, which is filtered before group by:

EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = 
'8585742290196126178';
//Return optimization statement
SELECT UserID
FROM hits_v1
WHERE UserID = \'8585742290196126178\'
GROUP BY UserID

Subqueries also support predicate pushdown:

EXPLAIN SYNTAX
SELECT *
FROM 
(
 SELECT UserID
 FROM visits_v1
)
WHERE UserID = '8585742290196126178'
//Return optimized statement
SELECT UserID
FROM 
(
 SELECT UserID
 FROM visits_v1
 WHERE UserID = \'8585742290196126178\' 
)
WHERE UserID = \'8585742290196126178\'

Another complex example:

EXPLAIN SYNTAX 
SELECT * FROM (
 SELECT 
 * 
 FROM 
 (
 SELECT 
 UserID 
 FROM visits_v1) 
 UNION ALL 
 SELECT 
 *
 FROM 
 (
 SELECT 
 UserID 
 FROM visits_v1)
)
WHERE UserID = '8585742290196126178'
//Return optimized statement
SELECT UserID
FROM 
(
 SELECT UserID
 FROM 
 (
 SELECT UserID
 FROM visits_v1
 WHERE UserID = \'8585742290196126178\'
 )
 WHERE UserID = \'8585742290196126178\'
 UNION ALL
 SELECT UserID
 FROM 
 (
 SELECT UserID
 FROM visits_v1
 WHERE UserID = \'8585742290196126178\'
 )
 WHERE UserID = \'8585742290196126178\' )
WHERE UserID = \'8585742290196126178\'

5 Aggregate calculation extrapolation

Calculations within aggregate functions are extrapolated, for example:

EXPLAIN SYNTAX
SELECT sum(UserID * 2)
FROM visits_v1
//Return optimized statement
SELECT sum(UserID) * 2
FROM visits_v1

6 Aggregate function elimination

If the min, max, any aggregate function is used for the aggregate key, that is, group by key, then the function is eliminated.

For example:

EXPLAIN SYNTAX
SELECT
     sum(UserID * 2),
     max(VisitID),
     max(UserID)
FROM visits_v1
GROUP BY UserID
//Return optimized statement
SELECT 
     sum(UserID) * 2,
     max(VisitID),
     UserID
FROM visits_v1
GROUP BY UserID

7 Delete duplicate order by key

For example, in the following statement, duplicate aggregate key id fields are deduplicated:

EXPLAIN SYNTAX
SELECT *
FROM visits_v1
ORDER BY
     UserID ASC,
     UserID ASC,
     VisitID ASC,
     VisitID ASC
//Returns the optimized statement:
select
    ......
FROM visits_v1
ORDER BY 
    UserID ASC,
    VisitID ASC

8 Remove duplicate limit by key

For example, in the following statement, the name field of a duplicate declaration is deduplicated:

EXPLAIN SYNTAX
SELECT *
FROM visits_v1
LIMIT 3 BY
     VisitID,
     VisitID
LIMIT 10
//Returns the optimized statement:
select
    ......
FROM visits_v1
LIMIT 3 BY VisitID
LIMIT 10

9 Delete duplicate USING Key

For example, in the following statement, duplicate key id fields are deduplicated:

EXPLAIN SYNTAX
SELECT
     a.UserID,
     a.UserID,
     b.VisitID,
     a.URL,
     b.UserID
FROM hits_v1 AS a
LEFT JOIN visits_v1 AS b USING (UserID, UserID)


//Returns the optimized statement:
SELECT 
     UserID,
     UserID,
     VisitID,
     URL,
     b.UserID
FROM hits_v1 AS a
ALL LEFT JOIN visits_v1 AS b USING (UserID)

10 Scalar Replacement

If the subquery returns only one row of data, replace it with a scalar when referenced, such as the total_disk_usage field in the following statement:

EXPLAIN SYNTAX
WITH 
 (
 SELECT sum(bytes)
 FROM system.parts
 WHERE active
 ) AS total_disk_usage
SELECT
 (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
 table
FROM system.parts
GROUP BY table
ORDER BY table_disk_usage DESC
LIMIT 10; 

//Returns the optimized statement:
WITH CAST(0, \'UInt64\') AS total_disk_usage
SELECT 
 (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
 table
FROM system.parts
GROUP BY table
ORDER BY table_disk_usage DESC
LIMIT 10

11 Ternary Operational Optimization

If the optimize_if_chain_to_multiif parameter is turned on, the ternary operator is replaced by the multiIf function.

For example:

EXPLAIN SYNTAX 
SELECT number = 1 ? 'hello' : (number = 2 ? 'world' : 'atguigu') 
FROM numbers(10) 
settings optimize_if_chain_to_multiif = 1;

//Returns the optimized statement:
SELECT multiIf(number = 1, \'hello\', number = 2, \'world\', \'atguigu\')
FROM numbers(10)
SETTINGS optimize_if_chain_to_multiif = 1

Posted by nicandre on Wed, 22 Sep 2021 11:12:40 -0700