A summary of the use of clickhouse

Keywords: Big Data SQL clickhouse bitmap

It is said that clickhouse is a columnar storage database used in olap scenarios with a large amount of data. Fortunately, it can also be used in actual scenarios. Let's talk about the simple experience of using this article.


1. Overall description

Not much about architecture, column storage, large amount of data and high performance. See official document address: https://clickhouse.com/docs/en/

For users, in addition to the general architecture, it is more about how to use it.

On the whole, the use of clickhouse basically complies with the ordinary sql specification, so basically as long as you can write sql, there is no problem with ordinary addition, deletion, modification and query. In other words, the problem is not big for business.

For example: create table; select xx from t; insert into table xx... ; alter table xx update x=xx...; (of course, this usage is a little different); alter table xx delete where x=xx... (similarly);


2. Storage engine Brief

The biggest feature of a database should be its storage engine or storage mode. This is more obvious in clickhouse. It has a super many storage engines. Whether you need them or not, there is a wide range of options.

Among them, the most commonly used or easiest to use is the MergeTree series, which is simply the storage structure of the merge tree. The query must be fast. In addition, it is also suitable for the storage of large amounts of data. So, generally choose this thing. Of course, there are many subclasses under it, which need to be changed accordingly.

For example, replicated mergetree means that there are multiple nodes storing data, which is necessary for highly available queries (queries for any node are also necessary).

AggregatingMergeTree represents the current node. It is a data fragmentation method aggregated by primary key.

For the MergeTree engine alone, if you want to have more optimized applications or special needs, you must personally read the official documents of clickhouse. It's really distressing to have too many choices.

Other storage engines, such as Log series, will be used in fewer scenarios. They can be considered when they are generally used as temporary tables. Others, such as File, can be used as a parser...

In short, it is not easy to fully understand ck's storage engine unless you use it deeply.


3. Primary key in Clickhouse

In clickhouse, it is not explicitly stated that there must be a primary key, but when creating a table, the sorting field will be used as the primary key by default.

The function of its primary key is equivalent to that of an ordinary index to a certain extent, which may be why it is not explicitly called a primary key, because it does not need to be unique but is conducive to search.

But it still has   Definition of Primary Key.


4. curd sql

We only talk about the simplest way, but in fact, a very big feature of clickhouse is that its sql is very diverse and flexible. Whether you can use it or not, it has many functions anyway. And the document is also hehe.

-- Create table, It is worth noting that it can very complex expiration policies
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
    name1 [type1] [NULL|NOT NULL] [DEFAULT|MATERIALIZED|ALIAS expr1] [compression_codec] [TTL expr1],
    name2 [type2] [NULL|NOT NULL] [DEFAULT|MATERIALIZED|ALIAS expr2] [compression_codec] [TTL expr2],
    PRIMARY KEY(expr1[, expr2,...])],
) ENGINE = MergeTree comment 'xxx'
    d + INTERVAL 1 WEEK TO VOLUME 'aaa',
    d + INTERVAL 2 WEEK TO DISK 'bbb';
-- The syntax of updating and deleting is unique enough. It is said that it is designed to make everyone use this function less. It's powerful
ALTER TABLE [db.]table UPDATE column1 = expr1 [, ...] WHERE filter_expr
ALTER TABLE [db.]table [ON CLUSTER cluster] DELETE WHERE filter_expr
-- Query has many unique uses, such as WITH
[WITH expr_list|(subquery)]
SELECT [DISTINCT [ON (column1, column2, ...)]] expr_list
[FROM [db.]table | (subquery) | table_function] [FINAL]
[SAMPLE sample_coeff]
[GLOBAL] [ANY|ALL|ASOF] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI] JOIN (subquery)|table (ON <expr_list>)|(USING <column_list>)
[WHERE expr]
[HAVING expr]
[ORDER BY expr_list] [WITH FILL] [FROM expr] [TO expr] [STEP expr]
[LIMIT [offset_value, ]n BY columns]
[LIMIT [n, ]m] [WITH TIES]
[UNION  ...]
[FORMAT format]
-- Data insertion can be used as exclusive insertion syntax
INSERT INTO [db.]table [(c1, c2, c3)] VALUES (v11, v12, v13), (v21, v22, v23), ...
INSERT INTO [db.]table [(c1, c2, c3)] SELECT ...
INSERT INTO insert_select_testtable (* EXCEPT(b)) Values (2, 2);
-- Execution plan query is very helpful to understand its internal mechanism. Its execution plan is very detailed, but it also looks a little scary
EXPLAIN [AST | SYNTAX | PLAN | PIPELINE] [setting = value, ...] SELECT ... [FORMAT ...]

5. Local and cluster

Although clickhouse is known as a large amount of real-time analysis database, it does not absolutely use distributed storage. Instead, two terms are constructed for selection, that is, local surface and distributed table.

As the name suggests, a local table is just a table stored on the local machine. This local table can only be used in some scenarios, such as when the node you are connected to is always the same machine. At this time, it is the same as databases such as mysql, and its storage capacity and performance are stand-alone. However, it is worth noting that when it is associated with the distribution table, there may be an unexpected result: an error.

Distributed table, that is, it can be queried on each node. This is the real distributed storage of large amount of data, and I don't care where the data is stored. As long as you can give the right results. In fact, behind the distribution tables are local tables. However, it generally requires a replica storage mechanism.

However, it is very rogue that we cannot directly create distributed tables. Instead, we must first create local tables, and then create corresponding distributed tables based on them. That is, we need to do a table building job twice to complete it.

Moreover, for data writing, you can write to local tables or distributed tables. Although the number of entrances has indeed increased, it also gives everyone a confused feeling.

-- Create original crowd table
create table loc_tab1 ON CLUSTER ck_cluster (
    xx String comment 'xx',
    version_no String comment 'version, for update'
) ENGINE = ReplicatedMergeTree
partition by xx
order by xx
-- Create distributed tables
CREATE TABLE distributed_tab2 AS loc_tab1 ENGINE = Distributed(ck_cluster, currentDatabase(), loc_tab1, rand());


6. bitmap data operation reference

Bitmap data has been applied in some scenarios, such as code value data table, because of its ultra-high performance cross difference operation ability and the ability to save storage space. However, in order to build bitmap data, more pre work is often required, and due to the data compression of bitmap, it may not be able to deal with complex scenarios, which need to be evaluated in advance.

-- Create original bitmap surface
create table loc_tab1_bitmap ON CLUSTER ck_cluster (
    xx String comment 'xx',
    uv AggregateFunction(groupBitmap, UInt64),
    version_no String comment 'version, for update'
) ENGINE = ReplicatedMergeTree
partition by xx
order by xx
-- Create distributed tables
CREATE TABLE distributed_tab2_bitmap AS loc_tab1_bitmap ENGINE = Distributed(ck_cluster, currentDatabase(), loc_tab1_bitmap, rand());
-- Insert crowd bitmap Table data, Insert data into local tables and read data into distributed tables
-- The read data source table is generally required to be a wide distribution table, and its function is only to build bitmap Data, there will be an action of deleting after use
insert into loc_tab1_bitmap
    select xx, groupBitmapState(toUInt64OrZero(uid)) as uv,version_no
    from dist_data_raw
    group by xx,version_no;
-- Read the reference and find two bitmap The intersection of data and do it in another table group by 
with intersect_tab as ( select arrayJoin(bitmapToArray(bitmapAnd(user1, user2))) as uid  from (select uv as user1, 1 as join_id from distributed_tab2_bitmap   where xx = '1') t1  inner join (select uv as user2, 1 as join_id from distributed_tab2_bitmap   where  xx = '2') t2  on t1.join_id = t2.join_id ),a1 as (select x2, toUInt64(xx) as uid from distributed_tb3)  select x2,count(1) as cnt from a1 right join intersect_tab a2 on a1.uid = a2.uid group by x2 order by cnt desc


Posted by arlabbafi on Fri, 03 Dec 2021 00:26:30 -0800