This article is about Learning Guide for Big Data Specialists from Zero (Full Upgrade) ClickHouse: Partial supplement.
0 ClickHouse syntax optimization rules
ClickHouse's SQL optimization rules are based on RBO(Rule Based Optimization). Here are some optimization rules
1 Prepare test tables
1) Upload official datasets
Upload visits_v1.tar and hits_v1.tar to the virtual machine and extract them to the clickhouse data path
// Unzip to clickhouse data path sudo tar -xvf hits_v1.tar -C /var/lib/clickhouse sudo tar -xvf visits_v1.tar -C /var/lib/clickhouse //Modify Owned User sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets
2) Restart clickhouse-server
sudo clickhouse restart
3) Execute queries
clickhouse-client --query "SELECT COUNT(*) FROM datasets.hits_v1"
clickhouse-client --query "SELECT COUNT(*) FROM datasets.visits_v1"
Note: The official tar package contains library building, table building statements, data content, which is the most convenient way without manual library building and table building.
The hits_v1 table has more than 130 fields and 8.8 million data
The visits_v1 table has more than 180 fields and more than 16 million pieces of data
2 COUNT optimization
When the count function is called, if count() or count(*) is used and there is no where condition, then
The total_rows of system.tables are used directly, for example:
EXPLAIN SELECT count()FROM datasets.hits_v1; Union Expression (Projection) Expression (Before ORDER BY and SELECT) MergingAggregated ReadNothing (Optimized trivial count)
Notice the Optimized trivial count, which is an optimization of count.
This optimization will not be used if the count specific column field:
EXPLAIN SELECT count(CounterID) FROM datasets.hits_v1; Union Expression (Projection) Expression (Before ORDER BY and SELECT) Aggregating Expression (Before GROUP BY) ReadFromStorage (Read from MergeTree)
3 Eliminate subquery duplicate fields
The following statement has two duplicate id fields in the subquery that will be de-duplicated:
EXPLAIN SYNTAX SELECT a.UserID, b.VisitID, a.URL, b.UserID FROM hits_v1 AS a LEFT JOIN ( SELECT UserID, UserID as HaHa, VisitID FROM visits_v1) AS b USING (UserID) limit 3; //Return the optimization statement: SELECT UserID, VisitID, URL, b.UserID FROM hits_v1 AS a ALL LEFT JOIN ( SELECT UserID, VisitID FROM visits_v1 ) AS b USING (UserID) LIMIT 3
4 Predicate Push Down
When a group by has a having clause but is not decorated with cube, with rollup, or with totals, the having filter pushes down to where to filter ahead. For example, in the query below, HAVING name becomes WHERE name, which is filtered before group by:
EXPLAIN SYNTAX SELECT UserID FROM hits_v1 GROUP BY UserID HAVING UserID = '8585742290196126178'; //Return optimization statement SELECT UserID FROM hits_v1 WHERE UserID = \'8585742290196126178\' GROUP BY UserID
Subqueries also support predicate pushdown:
EXPLAIN SYNTAX SELECT * FROM ( SELECT UserID FROM visits_v1 ) WHERE UserID = '8585742290196126178' //Return optimized statement SELECT UserID FROM ( SELECT UserID FROM visits_v1 WHERE UserID = \'8585742290196126178\' ) WHERE UserID = \'8585742290196126178\'
Another complex example:
EXPLAIN SYNTAX SELECT * FROM ( SELECT * FROM ( SELECT UserID FROM visits_v1) UNION ALL SELECT * FROM ( SELECT UserID FROM visits_v1) ) WHERE UserID = '8585742290196126178' //Return optimized statement SELECT UserID FROM ( SELECT UserID FROM ( SELECT UserID FROM visits_v1 WHERE UserID = \'8585742290196126178\' ) WHERE UserID = \'8585742290196126178\' UNION ALL SELECT UserID FROM ( SELECT UserID FROM visits_v1 WHERE UserID = \'8585742290196126178\' ) WHERE UserID = \'8585742290196126178\' ) WHERE UserID = \'8585742290196126178\'
5 Aggregate calculation extrapolation
Calculations within aggregate functions are extrapolated, for example:
EXPLAIN SYNTAX SELECT sum(UserID * 2) FROM visits_v1 //Return optimized statement SELECT sum(UserID) * 2 FROM visits_v1
6 Aggregate function elimination
If the min, max, any aggregate function is used for the aggregate key, that is, group by key, then the function is eliminated.
For example:
EXPLAIN SYNTAX SELECT sum(UserID * 2), max(VisitID), max(UserID) FROM visits_v1 GROUP BY UserID //Return optimized statement SELECT sum(UserID) * 2, max(VisitID), UserID FROM visits_v1 GROUP BY UserID
7 Delete duplicate order by key
For example, in the following statement, duplicate aggregate key id fields are deduplicated:
EXPLAIN SYNTAX SELECT * FROM visits_v1 ORDER BY UserID ASC, UserID ASC, VisitID ASC, VisitID ASC //Returns the optimized statement: select ...... FROM visits_v1 ORDER BY UserID ASC, VisitID ASC
8 Remove duplicate limit by key
For example, in the following statement, the name field of a duplicate declaration is deduplicated:
EXPLAIN SYNTAX SELECT * FROM visits_v1 LIMIT 3 BY VisitID, VisitID LIMIT 10 //Returns the optimized statement: select ...... FROM visits_v1 LIMIT 3 BY VisitID LIMIT 10
9 Delete duplicate USING Key
For example, in the following statement, duplicate key id fields are deduplicated:
EXPLAIN SYNTAX SELECT a.UserID, a.UserID, b.VisitID, a.URL, b.UserID FROM hits_v1 AS a LEFT JOIN visits_v1 AS b USING (UserID, UserID) //Returns the optimized statement: SELECT UserID, UserID, VisitID, URL, b.UserID FROM hits_v1 AS a ALL LEFT JOIN visits_v1 AS b USING (UserID)
10 Scalar Replacement
If the subquery returns only one row of data, replace it with a scalar when referenced, such as the total_disk_usage field in the following statement:
EXPLAIN SYNTAX WITH ( SELECT sum(bytes) FROM system.parts WHERE active ) AS total_disk_usage SELECT (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage, table FROM system.parts GROUP BY table ORDER BY table_disk_usage DESC LIMIT 10; //Returns the optimized statement: WITH CAST(0, \'UInt64\') AS total_disk_usage SELECT (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage, table FROM system.parts GROUP BY table ORDER BY table_disk_usage DESC LIMIT 10
11 Ternary Operational Optimization
If the optimize_if_chain_to_multiif parameter is turned on, the ternary operator is replaced by the multiIf function.
For example:
EXPLAIN SYNTAX SELECT number = 1 ? 'hello' : (number = 2 ? 'world' : 'atguigu') FROM numbers(10) settings optimize_if_chain_to_multiif = 1; //Returns the optimized statement: SELECT multiIf(number = 1, \'hello\', number = 2, \'world\', \'atguigu\') FROM numbers(10) SETTINGS optimize_if_chain_to_multiif = 1